fredrikj.net / blog /

Parallelism and scaling

August 24, 2013

Computers nowadays have lots of cores: the laptop I’m currently using has two, and the servers on which I do more expensive computations at RISC have 16 or 32. I’ve also had the opportunity to run a couple of jobs on the 2048-processor MACH computer.

Some tasks in computer algebra and computational number theory are inherently very easy to parallelize: it turns out that one has to compute 503,614 different cases of something, and then one just distributes the work to separate processes. But this is not always the case. Sometimes it comes down to a single, huge calculation, or perhaps a single job uses so much memory that there isn’t any chance to do several in parallel. To take advantage of many cores, one then needs multithreaded algorithms for single tasks.

I recently added an interface to FLINT for multithreading. It’s dead simple to use: one inserts the line

flint_set_num_threads(10);

in one’s program, and then (hopefully) gets a 10x speedup. At the moment, it doesn’t do very much, because the algorithms that take advantage of it are yet to be implemented. So far, multithreading is only supported by two functions in Arb: real matrix multiplication, and evaluation of the power sum in the Riemann zeta function when computing many derivatives.

For example, here are the timings for using fmprb_mat_mul to multiply a randomly generated 1887 x 561 matrix by a randomly generated 561 x 1045 matrix with 2050-bit floating-point ball entries (obtained on a 32-core system which was under some load at the moment):

Threads CPU time Wall time
1 1124.35 1124.79
2 1149.58 597.558
4 1157.59 310.316
8 1221.97 177.977

A smaller multiplication, (462 x 2) x (2 x 663) with 142-bit precision:

Threads CPU time Wall time
1 0.21 0.205
2 0.2 0.103
4 0.2 0.056
8 0.22 0.034

We get nearly linear scaling with a small number of cores. This is expected since the task largely is CPU-bound (multiplying arbitrary-precision floating-point numbers takes much longer time than reading them from memory). For very small matrices with small entries (where the multiplication would take less than a millisecond or so), fmprb_mat_mul does not attempt to parallelize the computations regardless of the number of requested threads, since the parallel gain would not be worth the overhead of starting the threads.

More multithreaded algorithms will surely be implemented in the future. The operations that would be most useful to parallelize in FLINT are multiplication of single huge integers and polynomials. Most other operations are based on multiplication, so this would provide a general speedup. I can’t predict when this feature will be available, though.

Another kind of scaling I’ve just implemented in Arb is automatic argument scaling for multiplication of real and complex polynomials. Instead of computing the product $P(x) Q(x)$ directly, we compute $P(2^s x) Q(2^s x)$ and then compose the result by $2^{-s} x$, where $s$ is chosen so that both input polynomials (hopefully) have slowly varying coefficients. This allows fitting the polynomials using fewer and smaller rectangular blocks when doing the subsequent exact multiplications over $\mathbb{Z}[x]$. Scaling by a power of two is cheap since it just requires adjusting exponents.

The difficulty is to choose $s$ optimally. In general, this is clearly a rather hard optimization problem, especially considering that choosing $s$ has to be extremely fast so as to not create further overhead. I settled for the simplest heuristic I could think of that should give near-optimal $s$ in most typical cases: namely to take the slopes of the $\log_2$-magnitudes between the first and last nonzero term in each polynomial, and then taking $s$ to be the average of the two slopes (actually, the average weighted by the respective lengths, since that is easier to compute, and probably makes sense anyway).

It does work quite well, especially with coefficients that grow like $n!$ or $1/n!$ (as is very common). For example, expanding the rising factorial $(\pi+x)(\pi+x+1)…(\pi+x+n-1)$ with n = 20000 and 100-bit precision previously took 16.4 seconds on my laptop and now takes 4.4 seconds, with no loss of accuracy. On some more complex computations I’ve been testing recently, I’m getting speedups between 1.2x and 2x for the polynomial stages. So far, I have not come up with a natural example where scaling results in a significant slowdown, though I suppose it wouldn’t be difficult to construct one intentionally.

Finally, here are old and new timings in (wall time) seconds for computing the first 10000 Li coefficients with Arb. The new computation is done with 10 threads for some parts of the evaluation of the Riemann zeta function, and other speedups come from various algorithmic improvements (including the scaling trick mentioned above):

Step Old New
Zeta 1272 87.5
Log 8.3 7.6
Gamma 3.4 2.3
Composition 185 3.6
Total 1469 101