# Numerically stable fast polynomial arithmetic

April 2, 2013

In the previous post, I used numerical evaluation and power series arithmetic to test the positivity of the Li coefficients $\lambda_n$ up to $n = 4000$. After finishing the writeup, I let a computation up to $n = 10000$ run over the weekend. It finished successfully, and the three last coefficients turn out to be:

$$\lambda_{9998} \approx 3.47356225328307$$

$$\lambda_{9999} \approx 3.47361066865106$$

$$\lambda_{10000} \approx 3.47365797322416$$

I used a working precision of $10n$ bits, roughly 30000 decimal digits, which resulted in approximately 5600 accurate digits for $\lambda_{10000}$ (meaning that the precision could have been set about 20% lower to just prove positivity, saving some time). The whole computation took a bit more than 40 hours, nearly all of which was spent evaluating the Riemann zeta function.

The precision had to be set that high in order to use the fast power series arithmetic in Arb, which basically works by converting floating-point polynomials to scaled integer polynomials (discarding small parts) and multiplying via FLINT, plus cross-multiplying error norms. This works extremely well when all polynomial coefficients have roughly the same magnitude (or only the absolute precision matters). In this computation, however, the coefficients apparently have wildly varying magnitudes, and when implemented this way, the power series operations catastrophically amplify the errors from the computation of the zeta function. With perfectly accurate power series arithmetic, I would expect something like $n$ or $2n$ bits of working precision to be sufficient.

During the weekend, I therefore also tried disabling fast polynomial multiplication in Arb, falling back to the numerically stable $O(n^2)$ classical multiplication algorithm. With this change, it turned out that only about $1.3n$ bits of precision is necessary, and computing $\lambda_n$ up to $n = 10000$ only took three and a half hours, more than ten times faster! In fact, the cost for the zeta function dropped essentially by a factor 100 (as might be expected from a reduction from $b = 10n$ to $b = 1.3n$ bits, because the complexity for the zeta function is roughly $O(nb^2)$); the computation time was now instead dominated by the accurate but slow power series operations.

This motivated me to finally implement a slightly more clever polynomial multiplication algorithm, which now is enabled by default (though I will need to do some more testing before I can claim that it is production-ready). The new implementation essentially guarantees similar accuracy as the classical multiplication algorithm, is no slower than the classical algorithm for worst-case input, and typically has performance close to the truncating multiplication for input where the truncating multiplication would be accurate. (The truncating multiplication algorithm is still available for use in algorithms where it is numerically satisfactory.)

The idea is pretty simple. If we take a typical truncated power series with floating-point coefficients, say $\sum_k x^k / k!$, we can visualize it as a 2D array of bits with the monomial index $k$ on the horizontal axis and the exponents on the vertical axis. The coefficients are enveloped between two curves $b$ bits apart. Truncating to a scaled integer polynomial means that we just try to cover the most significant part of the polynomial with a rectangle of height $b$: It is clear that this throws away a lot of information when the coefficients have very different magnitudes (as in the picture). We could avoid this problem by fitting a rectangle to the whole polynomial without throwing away any bits: Unfortunately, this algorithm can be wasteful. If the magnitudes vary too much, it will perform much worse than the classical multiplication algorithm, and we may even run out of memory. A compromise is to split the polynomial into blocks so that the whole polynomial is accounted for but so that each block only is slightly taller than $b$ bits: After splitting both input polynomials this way, we perform a nested multiplication with classical multiplication for the outer loop and exact integer polynomial multiplication for the individual blocks. In the worst-case scenario, each block has length 1 and we end up with classical multiplication; in the best case, both polynomials are split into just a few blocks and the algorithm is just some factor slower than the truncating multiplication.

This is really just the most rudimentary possible approach to numerically stable fast multiplication. Joris van der Hoeven has written a paper that describes a much more sophisticated algorithm. Van der Hoeven likewise uses block decomposition, but he chooses the blocks more cleverly and in such a way that subproducts that do not contribute to the result can be discarded. He also incorporates scaling (substituting $x \to c x$ in both polynomials to make rectangles fit better). These improvements seem very complicated to implement optimally, but would of course be great to have in the future (the scaling trick can probably be implemented with relatively little effort, requiring only some decent heuristic for picking $c$).

In ball arithmetic, it is also necessary to compute the propagated error, bounding the coefficients of $(P_1+\epsilon_1)(P_2+\epsilon) – P_1 P_2 = P_1 \epsilon_2 + P_2 \epsilon_1 + \epsilon_1 \epsilon_2$. This requires extra polynomial multiplications. For now, I just do three $O(n^2)$ multiplications with exponent bounds (the time constant in the big-$O$ being one nanosecond or so); later on, this too should be done more cleverly for large $n$.

As a comparison of the algorithms, consider multiplying the power series $P_1 = \sum_{k=0}^{n-1} x^k/k!$, $P_2 = \sum_{k=0}^{n-1} x^k/(k+1)$, $P_3 = \sum_{k=0}^{n-1} x^k k!$ with $n = 1000$ (we only compute the low $n$ coefficients of the output). The output below shows timings for each product with the respective algorithm, followed by the value of the coefficient of $x^0$, $x^{500}$ and $x^{999}$ in the output, written as (midpoint) +/- (error bound).

At low precision (30 bits):

P1 * P1, prec = 30
class  140 ms    1 +/- 0          2.6828e-984 +/- 8.2e-991    1.3314e-2264 +/- 8.5e-2271
block   35 ms    1 +/- 0          2.6828e-984 +/- 8.3e-991    1.3314e-2264 +/- 4.5e-2271
trunc  0.3 ms    1 +/- 4.2e-09    0           +/- 2.1e-06     0            +/- 4.2e-06

P1 * P2, prec = 30
class  130 ms    1 +/- 0          0.0054366 +/- 3.6e-09       0.002721 +/- 3.6e-09
block   16 ms    1 +/- 0          0.0054366 +/- 8.3e-09       0.002721 +/- 8.2e-09
trunc  0.4 ms    1 +/- 4.4e-09    0.0054366 +/- 2.2e-06       0.002721 +/- 4.4e-06

P1 * P3, prec = 30
class  130 ms    1 +/- 0          1.2226e+1134 +/- 8.9e+1127  4.0279e+2564 +/- 5.6e+2558
block   35 ms    1 +/- 0          1.2226e+1134 +/- 3.6e+1128  4.0279e+2564 +/- 2.2e+2559
trunc  0.4 ms    0 +/- 2.0e+2556  0            +/- 9.8e+2558  4.0279e+2564 +/- 2.0e+2559

P2 * P2, prec = 30
class   120 ms   1 +/- 0           0.027071 +/- 1.1e-08       0.014956 +/- 1.2e-08
block     3 ms   1 +/- 0           0.027071 +/- 1.5e-08       0.014956 +/- 1.5e-08
trunc   0.6 ms   1 +/- 4.7e-09     0.027071 +/- 2.3e-06       0.014956 +/- 4.7e-06


At higher precision:

P1 * P1, prec = 3000
class 1460 ms   1 +/- 0           2.6828e-984 +/- 7.1e-1885   1.3314e-2264 +/- 7.4e-3165
block  130 ms   1 +/- 0           2.6828e-984 +/- 3.5e-1885   1.3314e-2264 +/- 3.7e-3165
trunc   24 ms   1 +/- 3.7e-903    2.6828e-984 +/- 1.8e-900    0            +/- 3.7e-900

P1 * P2, prec = 3000
class 1460 ms   1 +/- 0           0.0054366 +/- 3.2e-903      0.002721 +/- 3.2e-903
block   77 ms   1 +/- 0           0.0054366 +/- 7.2e-903      0.002721 +/- 7.2e-903
trunc   24 ms   1 +/- 3.9e-903    0.0054366 +/- 1.9e-900      0.002721 +/- 3.9e-900

P1 * P3, prec = 3000
class 1060 ms   1 +/- 0           1.2226e+1134 +/- 7.8e+233   4.0279e+2564 +/- 4.8e+1664
block  110 ms   1 +/- 0           1.2226e+1134 +/- 3.1e+234   4.0279e+2564 +/- 1.9e+1665
trunc   25 ms   0 +/- 1.7e+1662   0            +/- 8.5e+1664  4.0279e+2564 +/- 1.7e+1665

P2 * P2, prec = 3000
class 1410 ms   1 +/- 0           0.027071 +/- 9.4e-903       0.014956 +/- 1.1e-902
block   27 ms   1 +/- 0           0.027071 +/- 1.3e-902       0.014956 +/- 1.3e-902
trunc   24 ms   1 +/- 4.0e-903    0.027071 +/- 2.0e-900       0.014956 +/- 4.1e-900


For $P_1 P_1$, the truncating multiplication is adequate if we care about the absolute error (and not the relative error) of the high-index coefficients. It is rather terrible for $P_1 P_3$, where we end up generating huge amounts of noise for the low-index coefficients (in a power series, these are likely to be important). Truncating multiplication is optimal for $P_1 P_2$ (where the truncated part of $P_1$ does not contribute to the result) and $P_2 P_2$ (where almost no truncation occurs). The block multiplication is nearly as fast as the truncating multiplication for $P_2 P_2$ at least at high precision (since the coefficients have similar magnitude); it is somewhat slower for $P_1 P_2$ since it does not recognize that some blocks could be discarded. Overall, the block algorithm performs much better than classical multiplication, while having similar numerical accuracy in all cases.

Back to the Li coefficients, below is a table comparing the cost of each stage of the computation, when done respectively with truncating multiplication (at $7n$ to $10n$ bits of working precision), classical multiplication ($1.3n$ bits of precision), and block multiplication (also $1.3n$ bits of precision). As before, zeta is the cost of evaluating the series expansion of the Riemann zeta function; log is the power series logarithm (for the computation done with truncating multiplication, I combined this time with zeta); gamma is the evaluation of the gamma function Taylor series; compose is the final power series composition. The timings are in seconds.

 Step Truncating Classical Block n = 1000 zeta 33.2 1.3 1.4 log - 1.5 0.05 gamma 0.6 0.02 0.02 composition 1.7 6.7 0.5 total 36 9.5 2.0 n = 4000 zeta 4144 66.5 69 log - 131 1.2 gamma 29 0.4 0.4 composition 98 373 15 total 4271 571 86 n = 10000 zeta 147236 1242 1272 log - 2760 8.3 gamma 781 3.4 3.4 composition 1994 7971 185 total 150011 11976 1469

To summarize, with the truncating multiplication, evaluating zeta is slow (due to the required extra precision) and power series arithmetic is fast (though slower than necessary due to the high precision needed); with classical multiplication evaluating zeta is fast and power series arithmetic is slow; with block multiplication, both zeta and the power series arithmetic is fast. For this task, “fast and accurate” (blockwise multiplication) is better than “slow and accurate” (classical multiplication) is better than “fast and inaccurate” (truncating multiplication), each improvement being one order of magnitude! We are down to less than half an hour for computing $\lambda_n$ up to $n = 10000$, which seems pretty good.