May 29, 2022

The obvious way to parallelize typical mathematical code is to do independent function calls in parallel. However, there are times where you'd want each individual function call to exploit multiple cores, either because you only want a single value with huge precision, because the task is inherently serial (for example, you are evaluating a recurrence relation $x_n = f(x_{n-1})$), or because you just want some existing non-parallel code to run faster without changes.

I'm currently experimenting with multithreaded elementary functions in a development branch of Arb, which should hopefully soon be ready to merge. These are some preliminary observations. For these timings, I linked Arb against a patched version of FLINT to enable multithreaded integer multiplication in the fmpz type (without this patch the speedups will be a bit smaller).

## Parallel bit-burst algorithm

Arb uses the asymptotically fast bit-burst algorithm to compute the core elementary functions exp, sin/cos and atan at precision above a couple of thousand digits (the remaining core function, log, is currently based on MPFR's implementation of the arithmetic-geometric mean, and will need different treatment). The idea for the exponential function is that one writes $\exp(x) = \exp(x_0) \exp(x_1) \exp(x_2) \cdots$ where $x_i$ extracts $2^{i}$ consecutive bits in the binary expansion of $x$. The exponentials $\exp(x_i)$ are then evaluated using binary splitting.

The easiest way to parallelize this algorithm is to perform the $\exp(x_i)$ calls in parallel. This change in arb_exp results in the following speedup computing $\exp(\sqrt{2}-1)$:

To give a concrete timing example, a 10-million digit exponential which takes 20 seconds on a single core now only takes 4.5 seconds exploiting all 8 cores.

An asymptotic speedup of more than 4x with 8 threads is not too bad; there are probably various memory-related bottlenecks, and the exponential function in Arb actually combines the bit-burst method with argument reduction ($\exp(x) = \exp(x/2^q)^{2^q}$), where the only possible parallelization is in the individual squaring operations. With some fine-tuning perhaps 5x speedup is possible.

The threshold for using the bit-burst algorithm is currently set around 19000 bits; with multiple threads, it seems that this should be set lower. Or perhaps the basecase algorithms should be parallelized as well. It would be interesting to see if one could achieve a decent parallel speedup below 10000 bits.

## Parallel binary splitting

An alternative method is to keep the function calls $\exp(x_i)$ serial but use parallel binary splitting in each such function evaluation. This more fine-grained parallelism should be more economical in terms of memory usage. Unfortunately, the speedup is not quite as good:

One reason for the worse results is that the final divisions in the binary splitting remain single-threaded, though there could be other factors as well (also discussed in the previous blog post). I've also tested the speedup with parallel binary splitting for trigonometric functions:

The worse speedup compared to exp is probably due to the fact that the argument reductions for these functions involve additional divisions and/or square roots which, again, remain single-threaded operations for now.

## Memory usage

Speed is not the sole concern; for extreme high-precision arithmetic, memory usage is also an issue. Let us look at the peak memory consumption when computing $\exp(x)$ using 8 threads, relative to the single-threaded algorithm:

Parallel binary splitting uses about twice as much memory as the serial version, while the parallel bit-burst algorithm uses more than four times as much memory. This suggests that the parallel bit-burst algorithm should not be used beyond a certain point (above a billion bits maybe). Parallel binary splitting may be slightly slower, but we certainly don't want to run into unnecessary swapping for billion-digit computations. Elementary functions already use more than 50 times more memory than the size to store the input and output operands, and this could be interesting to optimize even for the serial code.