Parallelizing Arb for fun and profit

May 19, 2022

I finally got my new work laptop, and with a modern 8-core CPU (Zen3), one of the fun things I can do is add more multithreaded algorithms to FLINT and Arb.

Contents

Multithreading approach
Numerical integration
Plotting
Pi and friends
Bernoulli numbers
Parallel speedup

Multithreading approach

Certain key algorithms in FLINT and Arb support parallel computation via pthreads (as of FLINT 2.6 a thread pool is used internally). Notably, polynomial and matrix multiplication can take advantage of multiple threads. Only a single thread is used by default; to use up to $m$ threads, the user must call flint_set_num_threads(m).

For the upcoming FLINT 2.9 release, I have added two experimental helper functions to simplify parallel programming and abstract away details of the underlying thread pool implementation.

The first helper, flint_parallel_do, simply executes a given function $f(i)$ for $0 \le i < n$, assuming that the function calls can be run independently in parallel. A typical application is to evaluate a result mod $p$ for several primes $p$, where $f(i)$ reads the $i$th prime from an array and writes the result to the $i$th entry of another array.

The second function, flint_parallel_binary_splitting, performs the evaluate-and-fold operation $f(a) \circ f(a+1) \circ \cdots \circ f(b-1)$ where $f$ is a given function and $\circ$ is a given (associative) operation. As the name suggests, the evaluation is done with binary splitting to provide optimal complexity for accumulation-type operations with growing entries.

Not all parallel jobs can be expressed in terms of these primitives, but quite a few can. Using this approach, I have started parallelizing some interesting functions in the git version of Arb (with more to come).

Numerical integration

The rigorous numerical code in Arb combines adaptive domain splitting and degree-adaptive Gaussian quadrature. The domain splitting can certainly be parallelized, but this will involve implementing a work queue which is more than I care to do at the moment. To keep things extremely simple, I've just parallelized the function calls in the Gaussian quadrature as well as the computation of the Gauss-Legendre nodes and weights with flint_parallel_do. To see the effects, we can run the integrals.c example program:

build/examples/integrals -i 4 -prec 3333 -threads n -twice

This evaluates the test integral $\int_0^8 \sin(x+e^x) dx$ with 1000 decimals of accuracy. The -twice flag computes the same integral twice, so that we can get the timings for integration when Gauss-Legendre nodes have already been cached.

	Threads = 1	Threads = 2	Threads = 4	Threads = 8
Integral	2.413 s	1.208 s	0.671 s	0.397 s
Integral, second run	0.541 s	0.27 s	0.152 s	0.088 s

Both the nodes computation and the actual integration parallelize quite well. This will also translate to a parallel speedup for special functions like the Lerch transcendent that use numerical integration internally.

Plotting

Parallelizing the complex_plot.c example program is even easier than the integration code; the plotting code is (currently) nonadaptive and each pixel can be evaluated inependently. As a benchmark problem, we take

build/examples/complex_plot -range -20 20 -20 20 -size 1024 1024 -threads n zeta

which plots $\zeta(s)$ on $[-20,20] + [-20,20]i$ with 1 megapixel resolution.

	Threads = 1	Threads = 2	Threads = 4	Threads = 8
Complex plot	18.69 s	9.576 s	5.425 s	3.328 s

Again a nice speedup.

Pi and friends

Arb computes most mathematical constants like $\pi$, $e$, $\log(2)$, $\gamma$ and $\zeta(3)$ via binary splitting evaluation of hypergeometric or holonomic series $\sum_{n=a}^{b-1} f(n)$ with rational terms. To parallelize this process efficiently, we need to do two things:

assign the recursive computations of partial sums involving smaller numbers to separate threads, and
parallelize the work in the individual arithmetic operations on huge numbers near the root of the tree.

For (1), it is straightforward to rewrite the binary splitting functions in Arb to use flint_parallel_binary_splitting (I have now done this for most constants). For (2), I have changed functions like arb_mul to call FLINT's multithreaded integer multiplication code instead of GMP's serial mpn_mul for huge input (in fact, this also makes Arb slightly faster on a single thread, at least on my CPU).

A couple of timings:

	Threads = 1	Threads = 2	Threads = 4	Threads = 8
$\pi$, $10^6$ digits	0.223 s	0.140 s	0.118 s	0.114 s
$\pi$, $10^7$ digits	4.043 s	2.47 s	1.884 s	1.573 s
$\gamma$, $10^6$ digits	7.316 s	3.978 s	2.88 s	2.341 s

We get some nice parallel speedups, though not as nice as for the embarrassingly parallel jobs like integration and plotting. There are several reasons for this. First, there are some final full-precision operations (divisions and square roots) which are not yet based on multithreaded multiplication, so they remain serial bottlenecks for now. For million-digit computation of $\pi$ these operations account for roughly 30% of the time with 8 threads. Second, arithmetic on huge numbers is memory access-intensive, which is likely a limiting factor with multiple threads. Third, the overall scheduling is probably not optimal.

In the future, I expect to make most mathematical functions ($\exp(x)$ etc.) properly multithreaded as well at high precision, and hopefully with better efficiency; the standard mathematical constants are just the easiest to begin with.

Bernoulli numbers

In a recent post, I discussed the new algorithms in Arb for computing Bernoulli numbers $B_n$ and Euler numbers $E_n$ by combining David Harvey's multimodular method to retrieve the low bits with the Euler product to retrieve the high bits. I have now parallelized both these stages. The tasks are mostly embarrassingly parallel; some subproblems like computing $\pi$ also benefit from other improvements as described earlier, and the CRT combination stage now uses parallel binary splitting.

In addition, I have parallelized the separate code for computing the vector $B_0, \ldots, B_n$ of Bernoulli numbers simultaneously (which can be done far more efficiently than with isolated calls).

	Threads = 1	Threads = 2	Threads = 4	Threads = 8
$B_n$, $n = 10^6$	20.576 s	11.574 s	6.917 s	4.787 s
$E_n$, $n = 10^6$	227.0.48 s	120.216 s	66.074 s	40.18 s
$B_n$, vector $0 \le n \le 10^4$	0.738 s	0.451 s	0.256 s	0.166 s

I'm rather happy about being able to compute the millionth Bernoulli number in less than five seconds. Just one year ago, on a slower laptop and with a worse algorithm, it took almost four minutes!

Parallel speedup

Finally, a visual comparison of how various tasks scale with the number of threads:

In a perfect world all these graphs would be straight lines with a slope of 1. However, a 4x-6x speedup on 8 cores is not too bad. There is clearly some room for improvement, especially for the binary splitting based tasks.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor