fredrikj.net / blog /
Sometimes faster is worse
September 24, 2012
I’ve added support for multipoint evaluation and interpolation to the Arb fmprb_poly module. The fast multipoint evaluation algorithm evaluates a length-$n$ (i.e. degree $n-1$) polynomial simultaneously at $n$ points in $O(n^{1+\varepsilon})$ operations, compared to $O(n^2)$ for repeated use of Horner’s rule. This is done by viewing evaluation as performing polynomial division, and using a balanced product tree to exploit FFT polynomial multiplication.
Interpolation is the inverse operation, recovering a polynomial (its coefficients the usual monomial basis) from a list of $n$ distinct evaluation points and values. In ball arithmetic, we allow the inputs to be approximate numbers, and the output is a polynomial with ball coefficients, guaranteed to envelop the interpolating polynomials of all particular exact inputs contained in the input balls. Three different algorithms are provided: barycentric Lagrange interpolation, interpolation using Newton basis polynomials, and fast interpolation (based on the same principles as the fast evaluation algorithm). The first two cost $O(n^2)$ floating-point operations, while the last costs $O(n^{1+\varepsilon})$ floating-point operations.
Before continuing, I will just quickly mention that there is more to be said and done about evaluation and interpolation at special points, particularly roots of unity and roots of Chebyshev polynomials. Another interesting topic, of course, is to solve over- or underdetermined systems. More about such things some other day, perhaps.
As a test problem, we consider evaluating the Legendre polynomial $P_{n-1}(x)$ on $n$ equidistant points in the unit interval, and then recovering its coefficients from the x and y pairs using interpolation.
For example, with $n = 4$, and 53-bit precision, the input polynomial is [0 ± 0, -1.5 ± 0, 0 ± 0, 2.5 ± 0]. The x points are [0 ± 0, 0.33333333333333331 ± 5.5511e-17, 0.66666666666666663 ± 1.1102e-16, 1 ± 0], and the y values obtained by applying Horner’s rule are [0 ± 0, -0.40740740740740733 ± 2.5905e-16, -0.25925925925925936 ± 5.9212e-16, 1 ± 0].
Barycentric interpolation gives: [8.6042284408449642e-16 ± 1.1381e-14, -1.4999999999999978 ± 2.877e-14, -2.6645352591003757e-15 ± 5.6319e-14, 2.5000000000000022 ± 2.9426e-14]
Newton interpolation gives: [0 ± 0, -1.4999999999999989 ± 8.6659e-15, -2.2204460492503131e-15 ± 2.8505e-14, 2.5000000000000009 ± 2.0511e-14]
Fast interpolation gives: [0 ± 0, -1.4999999999999929 ± 2.5466e-13, -1.5987211554602254e-14 ± 9.5788e-13, 2.5000000000000098 ± 7.497e-13]
Knowing that the polynomial has integer coefficients when multiplied by two, we can, in fact, recover it exactly by picking the unique half-integer in each interval. (Were the radii not strictly smaller than 0.5, we could try repeating the calculation with higher precision until the output converges to desired accuracy.)
We increase the precision to 100 (decimal) digits and check what happens for larger $n$. The table shows the number of points (n), the automatically computed maximum error (i.e. coefficient radius) in the polynomial and in the x and y inputs, followed by the maximum coefficient radius in the interpolated polynomial (B – barycentric, N – Newton, F – fast), and the timings for each algorithm.
n err(P) err(x) err(y) err(B) err(N) err(F) T(B) T(N) T(F) ------------------------------------------------------------------------------ 4 0 e-101 e-100 e-98 e-98 e-97 41 us 15 us 33 us 8 0 e-101 e-99 e-94 e-94 e-87 0.22 ms 89 us 0.18 ms 16 0 e-101 e-96 e-86 e-86 e-63 1.0 ms 0.41 ms 0.72 ms 32 0 e-101 e-90 e-68 e-68 e-12 3.9 ms 1.9 ms 2.1 ms 64 0 e-101 e-78 e-31 e-31 inf 17 ms 7.8 ms 5.0 ms 128 0 e-101 e-53 e+42 e+42 inf 65 ms 31 ms 11 ms 256 e-06 e-101 e-04 e+189 e+189 inf 270 ms 130 ms 27 ms
The asymptotically fast interpolation algorithm does become faster than the quadratic-time algorithms for quite small $n$, but it also appears to be less numerically stable, making the output useless. But the results with barycentric and Newton interpolation also start to degenerate quite soon, reflecting the ill-conditioned nature of this kind of interpolation problem, so it’s clear that we need to work at much higher precision to handle polynomials of high degree anyway. The following happens when we crank the precision up to 10000 digits (in each case err(P) = 0, err(x) = e-10001):
n err(y) err(B) err(N) err(F) T(B) T(N) T(F) ------------------------------------------------------------------------ 4 e-10000 e-9998 e-9998 e-9997 5.1 ms 2.7 ms 4.1 ms 8 e-9999 e-9994 e-9994 e-9987 31 ms 15 ms 24 ms 16 e-9996 e-9986 e-9986 e-9963 0.14 s 0.07 s 0.11 s 32 e-9990 e-9968 e-9968 e-9912 0.55 s 0.30 s 0.33 s 64 e-9978 e-9931 e-9931 e-9808 2.2 s 1.2 s 0.93 s 128 e-9953 e-9858 e-9858 e-9596 8.9 s 4.9 s 2.4 s 256 e-9905 e-9711 e-9711 e-9169 35 s 20 s 5.9 s 512 e-9807 e-9417 e-9417 e-8312 140 s 80 s 14 s 1024 e-9611 e-8828 e-8828 e-6595 557 s 322 s 33 s
For multipoint evaluation, accuracy and timings look similar:
n err(Horner) err(fast) T(Horner) T(fast) ------------------------------------------------------------------------ 4 e-10000 e-9999 0.71 ms 0.89 ms 8 e-9999 e-9995 6.3 ms 8.6 ms 16 e-9996 e-9982 34 ms 50 ms 32 e-9990 e-9952 0.16 s 0.20 ms 64 e-9978 e-9890 0.67 s 0.62 s 128 e-9953 e-9762 2.8 s 1.7 s 256 e-9905 e-9500 11 s 4.3 s 512 e-9807 e-8974 45 s 10 s 1024 e-9611 e-7917 182 s 25 s
In general, it seems that about $O(n)$ extra precision is required to achieve full accuracy with fast evaluation or interpolation. Thus, the “fast” algorithms mostly make sense if one is computing something where the coefficients grow at least like $O(n \log n)$ or $C n$ for some large $C$ so that increasing the precision by an additional $O(n)$ is cheap. This does occur in practice, but in many cases, it means that algorithms claimed in to be “fast” based on the operation count alone are actually slower than “slow” algorithms, or perhaps only yield a very modest speedup for extremely large inputs.
I’m currently not entirely sure if the poor stability is an inherent difficulty with the fast algorithms, or if the evaluation structure can be tweaked to mitigate the problem. But it does seem to inherent; for example, I’ve tried switching to classical polynomial multiplication and division behind the scenes to see if it would improve stability, but it made no significant difference.
Similar observations are made in the paper “On the Stability of Fast Polynomial Arithmetic” by S. Köhler and M. Ziegler (in Proceedings of the 8th Conference on Real Numbers and Computers, 2008. Santiago de Compostela, Spain.) where the authors conclude:
We thus have shown that, in spite of its name, ‘fast’ polynomial arithmetic is as slow as the naive algorithms: for practically relevant instances over real numbers, due to the increased intermediate precision required for reasonable output accuracy. Surprisingly (at least to us), these instabilities are not so much due to polynomial division with remainder alone; it is the multiplication of several real polynomials which leads to ill-conditioned input to the polynomial division.
The big open challenge thus consists in devising some variant of fast polynomial arithmetic which is numerically sufficiently mild to yield a net benefit in running time over the naive O(n2) approach. As a first step, we suggest considering a replacement for repeated squaring to calculate polynomial powers.
Finally, here is one application of fast multipoint evaluation (there are many others): computing the $n$th entry in a hypergeometric sequence using $O(n^{1/2+\varepsilon})$ operations instead of the usual $O(n)$ operations (of course, we compute the same thing as a quotient of gamma functions, but this scales poorly if one also needs very high precision). In particular, we can use fast multipoint evaluation to speed up computation of rising factorials of a real argument. This is virtually the same algorithm that I described some time ago for quickly computing factorials modulo an integer. This is potentially useful for speeding up the argument reduction of the gamma function, as discussed in the previous post. In this case we need to compute $x(x+1)(x+2)\cdots(x+n-1)$ to a number of bits about five times n. We compare the $O(n)$ binary splitting algorithm with a precision of $5n$ bits, and the fast multipoint evaluation algorithm with a precision of $6n$ bits (i.e. increased to compensate for the poor stability), resulting in $5n$ bits ($1.5n$ digits) of accuracy for both:
n digits bsplit multipoint ------------------------------------------ 16 24 0.014 ms 0.028 ms 32 48 0.030 ms 0.062 ms 64 96 0.063 ms 0.14 ms 128 192 0.16 ms 0.38 ms 256 385 0.55 ms 1.2 ms 512 770 2.1 ms 5.8 ms 1024 1541 10 ms 28 ms 2048 3082 53 ms 120 ms 4096 6165 0.26 s 0.46 s 8192 12330 1.2 s 1.9 s 16384 24660 6.9 s 7.0 s 32768 49320 28 s 24 s 65536 98641 127 s 83 s 131072 197283 601 s 289 s
The fast multipoint algorithm does win asymptotically, but only beyond 25,000 digits, and it’s starting to eat all the available memory by the time it has gained much more than a factor two in speed (this comparison look much better in a few years, when I have a computer with 1 TB of RAM to feed it!).
Now, as you may recall from the last post, I used a trick in the binary splitting rising factorial to speed it up by a factor two. If I had just neglected to do that, the fast multipoint algorithm would have started winning close to 1000 digits, which would have been a much more spectacular achievement. This must be what Knuth had in mind when he said that premature optimization is the root of all evil.
fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor