Experiments with low-level ball arithmetic
May 16, 2012
In the implementation of ball arithmetic I wrote about previously, arithmetic is done naively using FLINT (fmpz) integers. This is easy to implement but adds quite a lot of overhead at low precision (up to a precision of several hundred digits).
I’ve started writing some experimental code for a low level implementation (based on mpn functions). The goal is that a single ball addition or multiplication should be as cheap as or cheaper than a single MPFR operation, ideally close to a mpz fixed-point arithmetic with no error bounding at all.
There are many different ways to do the low-level implementation, giving various tradeoffs between simplicity, speed, and accuracy. So far I’ve tried two approaches.
The first is to do straight up floating-point arithmetic with a full-limb radix (i.e. the digits are base 232 or 264, and the exponent measures the number of digits), like the GMP mpf type, but with a single limb (aligned with the bottom of the mantissa) for the radius. When the radius overflows one limb, the precision of the midpoint is reduced one limb. This has the advantage of requiring few adjustments, and adding numbers can be done with a single mpn_add without temporary space or extra shifts. One disadvantage is that since the top limb can be nearly empty, about half an extra limb is needed on average to obtain the same effective precision as with base 2 arithmetic.
The second is to do base-2 floating-point arithmetic, i.e. with a base-2 exponent, but with the mantissa still having a precision measured in a whole number of limbs, and with a single-limb radius aligned with the last limb as in the first representation. This requires some more adjustment code to make sure that the top bit of the top limb always is set. Advantages of this representation include the fact that the full representable precision is used, and that the error propagated in multiplication can be bounded very easily if one accepts always losing 1-2 bits of precision (this is too much in general, but can be useful for dot products).
Both representations have some minor disadvantages: whenever the radius overflows a limb, the error is amplified by a large factor (however, this is a rare event on average), the results are machine-dependent (this is not a problem if one uses the ball arithmetic as a lower-level routine for a computation intended to produce an exact or precisely defined result in the end, using adaptive precision or Ziv’s strategy for correct rounding), and the lack of precise semantics makes it somewhat difficult to write strong test code.
Other possible approaches include using base-2 arithmetic with a precision measured in bits, perhaps with the radius taking up less than a whole limb (making radius calculations even cheaper). This is likely to require more code, with more (potentially expensive) adjustment steps, but might make the arithmetic more well-behaved with fewer strange worst-case situations.
In any case, I’ve compared how long time it takes to multiply two n-limb numbers using mpn, mpz, FLINT fmpz, and MPFR functions, as well as the working fmpz-based ball arithmetic code (arb), and the two experimental implementations mentioned above (arb1 and arb2). To be precise, the mpz and fmpz timings indicate the combined time to do a multiplication and a shift, necessary to do a fixed-point multiplication.
We see that arb1 and arb2 are nearly equal, and both are faster than MPFR and nearly as fast as mpz, accomplishing the goal. There are a few special cases left (e.g. zero radius) that would need handling in a proper implementation, which should add a few nanoseconds to each, but it’s also possible that the code for either could be streamlined a bit.
I also implemented addition for the arb1 type:
Again, speed is much better for arb1 than for MPFR and arb. The timings for arb1 should actually be a bit better, closer to the mpz numbers. Right now I’m actually not sure why the difference is so large. The numbers get better if you only time it with a fixed set of inputs.
So far, the code is to incomplete to draw any conclusions about which approach is better. Code complexity is probably the biggest issue. In both versions, multiplication takes 150 lines or so of very dense code, and this code does not even handle all cases yet. It would be great to get rid of more branches, but that alone is difficult. Addition is not any easier (the arb1 implementation, which is supposed to be the one where addition is simple, has 150 lines of code and about 20 branches).