Experiments with low-level ball arithmetic

In the implementation of ball arithmetic I wrote about previously, arithmetic is done naively using FLINT (fmpz) integers. This is easy to implement but adds quite a lot of overhead at low precision (up to a precision of several hundred digits).

I’ve started writing some experimental code for a low level implementation (based on mpn functions). The goal is that a single ball addition or multiplication should be as cheap as or cheaper than a single MPFR operation, ideally close to a mpz fixed-point arithmetic with no error bounding at all.

There are many different ways to do the low-level implementation, giving various tradeoffs between simplicity, speed, and accuracy. So far I’ve tried two approaches.

The first is to do straight up floating-point arithmetic with a full-limb radix (i.e. the digits are base 232 or 264, and the exponent measures the number of digits), like the GMP mpf type, but with a single limb (aligned with the bottom of the mantissa) for the radius. When the radius overflows one limb, the precision of the midpoint is reduced one limb. This has the advantage of requiring few adjustments, and adding numbers can be done with a single mpn_add without temporary space or extra shifts. One disadvantage is that since the top limb can be nearly empty, about half an extra limb is needed on average to obtain the same effective precision as with base 2 arithmetic.

The second is to do base-2 floating-point arithmetic, i.e. with a base-2 exponent, but with the mantissa still having a precision measured in a whole number of limbs, and with a single-limb radius aligned with the last limb as in the first representation. This requires some more adjustment code to make sure that the top bit of the top limb always is set. Advantages of this representation include the fact that the full representable precision is used, and that the error propagated in multiplication can be bounded very easily if one accepts always losing 1-2 bits of precision (this is too much in general, but can be useful for dot products).

Both representations have some minor disadvantages: whenever the radius overflows a limb, the error is amplified by a large factor (however, this is a rare event on average), the results are machine-dependent (this is not a problem if one uses the ball arithmetic as a lower-level routine for a computation intended to produce an exact or precisely defined result in the end, using adaptive precision or Ziv’s strategy for correct rounding), and the lack of precise semantics makes it somewhat difficult to write strong test code.

Other possible approaches include using base-2 arithmetic with a precision measured in bits, perhaps with the radius taking up less than a whole limb (making radius calculations even cheaper). This is likely to require more code, with more (potentially expensive) adjustment steps, but might make the arithmetic more well-behaved with fewer strange worst-case situations.

In any case, I’ve compared how long time it takes to multiply two n-limb numbers using mpn, mpz, FLINT fmpz, and MPFR functions, as well as the working fmpz-based ball arithmetic code (arb), and the two experimental implementations mentioned above (arb1 and arb2). To be precise, the mpz and fmpz timings indicate the combined time to do a multiplication and a shift, necessary to do a fixed-point multiplication.

limbs mpn mpz fmpz mpfr arb arb1 arb2
1 10 30 57 48 139 26 32
2 18 44 51 58 230 57 51
3 30 54 68 89 259 70 64
4 44 68 83 100 279 88 82
5 62 88 100 129 300 110 100
6 83 110 129 150 340 130 130
7 110 129 150 180 380 160 150
8 139 170 180 200 400 180 190
9 170 210 220 240 440 230 220
10 210 259 270 300 510 280 270
11 259 310 320 490 550 320 310
12 300 340 360 530 610 360 360
13 350 400 420 519 660 410 410
14 410 450 470 640 700 480 470
15 470 519 530 640 769 540 530

We see that arb1 and arb2 are nearly equal, and both are faster than MPFR and nearly as fast as mpz, accomplishing the goal. There are a few special cases left (e.g. zero radius) that would need handling in a proper implementation, which should add a few nanoseconds to each, but it’s also possible that the code for either could be streamlined a bit.

I also implemented addition for the arb1 type:

limbs mpn mpz fmpz mpfr arb arb1
1 7 17 30 83 87 31
2 6 17 29 88 95 39
3 7 17 29 92 97 45
4 8 20 32 79 100 48
5 11 20 32 82 100 51
6 11 23 36 83 100 58
7 12 23 34 100 100 59
8 16 26 40 98 110 61
9 17 31 41 110 120 62
10 14 32 44 110 110 67
11 20 27 45 120 110 72
12 15 27 39 110 110 71
13 18 28 40 110 120 73
14 18 31 43 110 110 78
15 19 28 41 120 120 83

Again, speed is much better for arb1 than for MPFR and arb. The timings for arb1 should actually be a bit better, closer to the mpz numbers. Right now I’m actually not sure why the difference is so large. The numbers get better if you only time it with a fixed set of inputs.

So far, the code is to incomplete to draw any conclusions about which approach is better. Code complexity is probably the biggest issue. In both versions, multiplication takes 150 lines or so of very dense code, and this code does not even handle all cases yet. It would be great to get rid of more branches, but that alone is difficult. Addition is not any easier (the arb1 implementation, which is supposed to be the one where addition is simple, has 150 lines of code and about 20 branches).

Posted in Uncategorized | Leave a comment

Logarithms as well

In the last post, I discussed computing the exponential function. With that being implemented, it’s not much work to do logarithms as well. If we can evaluate the exponential function, then we can compute $y = \log x$ by using Newton’s method to find a root of the equation $\exp(y) – x = 0$, i.e. using the iteration $y_{n+1} + 1 – x / \exp(y_n)$.

Newton iteration is extremely efficient. Starting from a sufficiently accurate initial value, the number of correct digits roughly doubles with each iteration; moreover, the algorithm is “self-correcting”, so we only need to use full precision for the last iteration, 1/2 the precision for the second last iteration, 1/4 the precision for the iteration before that, etc. The total cost turns out to be only a fraction more than the time to evaluate a single exponential to the target precision.

In fact, it seems even better to use Halley’s method, which has the update step $y_{n+1} = y_n + 2 – 4t/(x + t)$ where $t = \exp(y_n)$. The amount of work per step is basically the same (one exponential and one division), but now the number of digits triples with each iteration! (We could avoid the division in the ordinary Newton iteration using $1/\exp(y_n) = \exp(-y_n)$, but it is more convenient to limit the basecase exponential to positive arguments at least for now.)

Newton’s method and Halley’s method are just the first two instances of Householder’s method, of order $d = 1$ and $d = 2$ respectively. Using higher-order Householder iterations, we can obtain arbitrarily rapid convergence, but for $d > 2$ the formulas start to become unwieldy. For $d = 3$, we have

$$y_{n+1} = \frac{6 x \left(x+2 t\right)}{x^2+4 x t+t^2}+t-3.$$

I have now implemented a first version of a basecase logarithm function, which evaluates $\log(1+x)$ on the standard interval $[0,1)$. It starts from a double-precision approximation and refines it to the target precision (modulo some rounding error) using Halley iteration. Here is how it performs (timings are in nanoseconds):

prec system dd qd mpfr mpmath new code
53 67 4900 22000 390
64 4800 22000 390
106 900 6700 24000 590
128 7800 26000 600
192 9200 29000 1100
212 20000 10000 30000 1500
256 13000 32000 1500
320 14000 35000 2100
384 16000 39000 2500

Comparing the numbers to the table from the last post, we find that log runs at around half the speed of exp. This ratio should improve at higher precision, but even at low precision, it is not a bad result, and the code is still about an order of magnitude faster than MPFR. Some savings are possible: spending around 70 nanoseconds on the initial value by calling the system logarithm is a bit excessive (a faster, sloppier double-precision logarithm could be used instead), and the division could probably be done slightly faster. But apart from that, the code is quite lean, and the only way to improve it is to improve the exponential function.

I should mention that the current code typically gives 1-2 bits less than full accuracy. To guarantee full accuracy, it is necessary to add a few guard bits. It shouldn’t actually be all that difficult to compute a rigorous error bound for the Halley iteration to make this automatic, at least in principle (translating it to correct and efficient code is probably a bit hairy), but this will have to wait since it requires error bounding for exp first.

This is actually probably the best way to compute the logarithm, up to a precision of several thousand digits. We could of course use a table lookup and a Taylor series correction, just as for exp, avoiding the overhead of Newton/Halley iteration. But a table lookup requires at least one division to transform the argument, and moreover nested table lookups aren’t possible, so we would need a much larger table for the same speed. If we use the convergence acceleration formula $\log(1+x) = 2^n \log((1+x)^{1/2^n})$), the complexity is roughly the same as for exp (with $\exp(x) = (\exp(x/2^n))^{2^n}$), but square roots are so expensive that this only becomes competitive at very high precision. Either way, it would be difficult to get rid of the 2x overhead compared to exp.

Posted in Uncategorized | Comments Off

Revisiting transcendental functions

The overwhelming majority of the time when I use arbitrary-precision arithmetic, I only need precision slightly higher than hardware precision; typically 30-40 digits, occasionally perhaps 100 digits, and only very rarely 1000 digits or more. About two years ago, I started doing some experiments to see how quickly the most common transcendental functions (exp, log, cos, sin, atan, gamma) can be computed in this range (the old test code, now obsolete, is available here).

There are three main ideas. The first is simply to do arithmetic very efficiently, only using the GMP functions that are written in assembly (mpn_add_n, mpn_addmul_1, mpn_mul_basecase, etc.), avoiding divisions like the plague, eliminating temporary memory allocations and copies, and so on.

The second is to use complexity reduction techniques aggressively. For elementary (or hypergeometric) functions, one can get a complexity of $d^{2.333}$ instead of the naive $d^3$ at $d$-digit precision by combining argument reduction with baby-step/giant-step polynomial evaluation. With the proper low-level implementation, this should be effective essentially from $d = 1$.

The third is to take advantage of precomputation to speed up repeated evaluations. RAM is cheap, and in the range up to 100 digits or so, the space needed to speed up the most commonly used functions by large factors is measured in kilobytes. Of course, use of lookup tables increases cache pressure, but a cache miss is much cheaper than a whole sequence of operations on multi-limb integers, so this tradeoff is generally worth it (some benchmarking I did a while back also indicated that prefetch instructions worked wonderfully for this purpose, at least on my CPU).

I have now written a new basecase implementation for exp(), substantially improved compared to my previous experiments; the code can be found here. Using a modest 224 KB lookup table (with a small modification, this could be trimmed to about 160 KB) for 16-bit argument reduction allows going up to a precision of about 384 bits, or 115 decimal digits, on a 64-bit system. In other words plenty for most practical purposes. The lookup table itself takes just a few microseconds to generate dynamically.

Here is how it compares to some other libraries: the system double-precision exp, the double-double (dd) and quad-double (qd) types from D. H. Bailey’s QD library (measured using the supplied tests/qd_timer program), MPFR, and mpmath (using the Python implementation with GMPY types, not the code in Sage which wraps MPFR). All timings are in nanoseconds.

prec system dd qd mpfr mpmath new code
53 49 4300 20000 180
64 5500 20000 190
106 790 7200 24000 330
128 7800 26000 340
192 10000 29000 520
212 6500 10000 31000 690
256 11000 31400 780
320 12000 34000 1100
384 14000 38000 1500

A couple of things need to be pointed out.

Right now, this code is experimental. I have not tested it thoroughly, and it does not come with an error bound (however, the error is never larger than a few ulp, and a bound can be bounded quite easily, for which I intend to include code later on).

All other libraries take an arbitrary $x$, not necessarily restricted to $[0,\ln 2)$. The initial argument reduction is a division with remainder, where a low-precision value of $1/\ln 2$ can be precomputed, meaning that only a single mpn_submul_1 plus some adjustments should be necessary.

The system exponential function, MPFR and mpmath also effectively use higher internal precision to compute the exponential with 0.5 or 1 ulp error, so to be fair a few guard bits should be added.

Even after adding all remaining corrections, the code can actually probably be made faster. At 1 or 2 limb precision, replacing all calls to GMP functions with inline assembly should improve performance; the current code also contains a redundant division, which should save 60 nanoseconds or so when removed.

Among the elementary functions, exp is the easiest to implement; cos, sin, atan and log can be computed using similar principles, although there is some more overhead, so they will perhaps be a factor two slower. The next logical step is to add a second version of exp, for precisions between 384 bits and a few thousand digits. It will necessarily be a bit slower than the basecase version, but should still be a decent factor faster than the MPFR function in this range.

Posted in Uncategorized | Comments Off

Algorithm selection for zeta(n)

In my last post, I mentioned using binary splitting to compute $\zeta(n)$ to extremely high precision for small $n$. I have now added a function to Arb for evaluating $\zeta(n)$ that selects between several different algorithms depending on both $n$ and the precision.

Here are timings compared to MPFR for even $n$:

Timings for zeta(n), even n

Here the Arb function either uses the formula $\zeta(2n) = (-1)^{n+1} B_{2n} (2\pi)^{2n} / (2(2n)!)$ (computing the Bernoulli number $B_{2n}$ exactly using FLINT), or the Euler product $(\zeta(n))^{-1} = \prod_{p} (1-p^{-n})$ for large $n$. You can see the cutoff quite clearly in the plot. At any size, this strategy is much faster than the generic algorithm used by MPFR.

In fact, since FLINT uses the Euler product to compute Bernoulli numbers, the Euler product effectively always ends up being used. That is, one basically uses the Euler product to compute $\zeta(n)$ to full precision, or to compute $B_n$ exactly, whichever requires less precision.

Timings for odd $n$ look a bit different:

Timings for zeta(n), odd n

There is a visible speedup at very high precision for small $n$ due to the use of binary splitting (timings can be improved further by writing better code for the special cases $n = 5, 7$), and a speedup for asymptotically large $n$ where Arb again uses the Euler product. In between, the code simply falls back to the MPFR function, so the timings are identical.

The pressing question is whether one can do better than MPFR here. Euler-Maclaurin summation might be a bit faster if optimized carefully, but I can’t really think of anything that would make a huge difference.

Another way to compute $\zeta(n)$ to high precision when $n$ is not too large is to use the algorithm of E. Karatsuba. However, my tests so far indicate that it is much slower than other methods in practice.

These timings are all for computing $\zeta(n)$ as an isolated value. Much better performance is possible when computing $\zeta(2), \zeta(3), \ldots, \zeta(n)$ simultaneously. More on that at a later time…

Posted in Uncategorized | Comments Off