Experiments with low-level ball arithmetic

May 16, 2012

In the implementation of ball arithmetic I wrote about previously, arithmetic is done naively using FLINT (fmpz) integers. This is easy to implement but adds quite a lot of overhead at low precision (up to a precision of several hundred digits).

I’ve started writing some experimental code for a low level implementation (based on mpn functions). The goal is that a single ball addition or multiplication should be as cheap as or cheaper than a single MPFR operation, ideally close to a mpz fixed-point arithmetic with no error bounding at all.

There are many different ways to do the low-level implementation, giving various tradeoffs between simplicity, speed, and accuracy. So far I’ve tried two approaches.

The first is to do straight up floating-point arithmetic with a full-limb radix (i.e. the digits are base 232 or 264, and the exponent measures the number of digits), like the GMP mpf type, but with a single limb (aligned with the bottom of the mantissa) for the radius. When the radius overflows one limb, the precision of the midpoint is reduced one limb. This has the advantage of requiring few adjustments, and adding numbers can be done with a single mpn_add without temporary space or extra shifts. One disadvantage is that since the top limb can be nearly empty, about half an extra limb is needed on average to obtain the same effective precision as with base 2 arithmetic.

The second is to do base-2 floating-point arithmetic, i.e. with a base-2 exponent, but with the mantissa still having a precision measured in a whole number of limbs, and with a single-limb radius aligned with the last limb as in the first representation. This requires some more adjustment code to make sure that the top bit of the top limb always is set. Advantages of this representation include the fact that the full representable precision is used, and that the error propagated in multiplication can be bounded very easily if one accepts always losing 1-2 bits of precision (this is too much in general, but can be useful for dot products).

Both representations have some minor disadvantages: whenever the radius overflows a limb, the error is amplified by a large factor (however, this is a rare event on average), the results are machine-dependent (this is not a problem if one uses the ball arithmetic as a lower-level routine for a computation intended to produce an exact or precisely defined result in the end, using adaptive precision or Ziv’s strategy for correct rounding), and the lack of precise semantics makes it somewhat difficult to write strong test code.

Other possible approaches include using base-2 arithmetic with a precision measured in bits, perhaps with the radius taking up less than a whole limb (making radius calculations even cheaper). This is likely to require more code, with more (potentially expensive) adjustment steps, but might make the arithmetic more well-behaved with fewer strange worst-case situations.

In any case, I’ve compared how long time it takes to multiply two n-limb numbers using mpn, mpz, FLINT fmpz, and MPFR functions, as well as the working fmpz-based ball arithmetic code (arb), and the two experimental implementations mentioned above (arb1 and arb2). To be precise, the mpz and fmpz timings indicate the combined time to do a multiplication and a shift, necessary to do a fixed-point multiplication.

limbs mpn mpz fmpz mpfr arb arb1 arb2
1 10 30 57 48 139 26 32
2 18 44 51 58 230 57 51
3 30 54 68 89 259 70 64
4 44 68 83 100 279 88 82
5 62 88 100 129 300 110 100
6 83 110 129 150 340 130 130
7 110 129 150 180 380 160 150
8 139 170 180 200 400 180 190
9 170 210 220 240 440 230 220
10 210 259 270 300 510 280 270
11 259 310 320 490 550 320 310
12 300 340 360 530 610 360 360
13 350 400 420 519 660 410 410
14 410 450 470 640 700 480 470
15 470 519 530 640 769 540 530

We see that arb1 and arb2 are nearly equal, and both are faster than MPFR and nearly as fast as mpz, accomplishing the goal. There are a few special cases left (e.g. zero radius) that would need handling in a proper implementation, which should add a few nanoseconds to each, but it’s also possible that the code for either could be streamlined a bit.

I also implemented addition for the arb1 type:

limbs mpn mpz fmpz mpfr arb arb1
1 7 17 30 83 87 31
2 6 17 29 88 95 39
3 7 17 29 92 97 45
4 8 20 32 79 100 48
5 11 20 32 82 100 51
6 11 23 36 83 100 58
7 12 23 34 100 100 59
8 16 26 40 98 110 61
9 17 31 41 110 120 62
10 14 32 44 110 110 67
11 20 27 45 120 110 72
12 15 27 39 110 110 71
13 18 28 40 110 120 73
14 18 31 43 110 110 78
15 19 28 41 120 120 83

Again, speed is much better for arb1 than for MPFR and arb. The timings for arb1 should actually be a bit better, closer to the mpz numbers. Right now I’m actually not sure why the difference is so large. The numbers get better if you only time it with a fixed set of inputs.

So far, the code is to incomplete to draw any conclusions about which approach is better. Code complexity is probably the biggest issue. In both versions, multiplication takes 150 lines or so of very dense code, and this code does not even handle all cases yet. It would be great to get rid of more branches, but that alone is difficult. Addition is not any easier (the arb1 implementation, which is supposed to be the one where addition is simple, has 150 lines of code and about 20 branches).