# Accelerated arithmetic in Arb 2.15

September 18, 2018

I have just issued version 2.15 of Arb. As usual, downloads are available via GitHub.

The big news in Arb 2.15 is that polynomial and matrix arithmetic is significantly faster in the basecase regimes. The speedup is typically about 2x for polynomials and matrices of size 10 to 30 when the precision is 1-2 words (up to 128 bits), with varying levels of improvement in other ranges. For certain operations, more than a 4x speedup is possible. The improvement is due to use of fast dot product evaluation in combination with more accurate cutoffs between different algorithms.

Arb 2.15 adds the functions arb_dot and acb_dot for computing dot products $\sum_{k=1}^N a_k b_k$. The normal way to compute a dot product until now was to do fused multiply-adds with arb_addmul / acb_addmul in a loop. However, each such multiply-add has to produce fully normalized arb_t or acb_t output. The main idea behind the new dot product functions is to treat the whole dot product as an atomic operation and avoid normalization and bookkeeping overhead for each partial sum. In addition to this, the new dot product functions inspect the error bounds for all entries in the vectors and for each term $a_k b_k$ only compute the leading bits that actually contribute to the final result, saving even more time when the terms vary in magnitude or have large radii.

There are also new functions arb_approx_dot and acb_approx_dot which compute floating-point dot products without error bounds, useful for approximate linear algebra.

The following table compares the performance of four methods to compute a dot product of two length-N vectors of real numbers (here, all numbers have roughly the same magnitude).

• mpfr_dot is a simple MPFR implementation calling mpfr_mul and mpfr_add in a loop (MPFR has a fused multiply-add function, but it is actually slower)
• arb_dot_simple is a simple Arb implementation calling arb_addmul in a loop
• arb_dot is the new, optimized dot product
• arb_approx_dot is the new, optimized dot product without error bounds

The table shows timings in nanoseconds (lower is better).

N prec mpfr_dot arb_dot_simple arb_dot arb_approx_dot
10 64 300 710 217 109
10 128 900 840 295 177
10 256 1240 1150 660 560
10 1024 3600 3400 2780 2700
10 4096 21300 24800 21300 21100
10 16384 173000 202000 172000 172000
100 64 3200 7700 1760 830
100 128 9300 8800 2320 1470
100 256 12700 12200 6000 5100
100 1024 35000 35000 26600 26000
100 4096 218000 256000 210000 211000
100 16384 1760000 2070000 1730000 1720000

At low precision, the new dot product functions are not only several times faster than the simple algorithm using arb_addmul; they are even significantly faster than MPFR arithmetic! Under ideal conditions, arb_dot reaches an equivalent performance of more than 100 million ball operations per second (where one ball addition or multiplication counts as an operation) while arb_approx_dot reaches more than 200 million floating-point operations per second. (This is on a 1.9 GHz Intel i5-4300U CPU.)

Complex numbers are uniformly about four times more expensive than real numbers:

N prec acb_dot_simple acb_dot acb_approx_dot
10 64 3000 850 420
10 128 3600 1070 690
10 256 5700 2920 2560
10 1024 15100 11400 10600
10 4096 96000 85000 85000
10 16384 660000 660000 660000

In Arb 2.15, calls to the new arb_dot and acb_dot functions have replaced the inner loops in basecase polynomial multiplication, matrix multiplication, power series inversion, triangular solving and many other operations. The following plot shows the speedup between Arb 2.14 and Arb 2.15 for various polynomial and power series operations, where N is the polynomial or series length. The comparison here is for complex arithmetic (acb_poly) at 64-bit precision. The following plot shows the speedup for matrix operations, where the matrices have size N. The comparison here is likewise for complex arithmetic (acb_mat) at 64-bit precision. When N is large enough, the new dot product code gives no improvement for most operations since Arb tries to reduce everything asymptotically to multiplying huge polynomials or matrices via FLINT. One visible exception is charpoly which currently does not use a reduction to matrix multiplication, so the pure dot product speedup is exhibited.

I would be interested in reports from users about whether the improvements are visible in "real-world" applications!

The dot product code was a lot of work to design and implement. In fact, I started working on it in spare evenings during my summer vacation in August so that I could take the time to meditate on the problem and figure it out slowly without other work-related distractions. I have not had time to work on any other major features in time for this release, but a small new feature in Arb 2.15 worth pointing out is a function for computing Airy function zeros (arb_hypgeom_airy_zero). Having this function in the library mostly serves the purpose of "special functions table-filling", but it will hopefully be useful to someone at some point!