fredrikj.net / blog /
Complex floats and some big root-finding speedups
June 10, 2024
The last post reported about the addition of faster medium-precision bigfloats (nfloat) to FLINT. Today, a quick update about the complex version (nfloat_complex) which has been added as well.
Contents
Basic operations
Speedup vs acf
Here are the speedups compared to FLINT's arbitrary-precision complex float type acf for various operations applied to length-100 vectors:
prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|
64 | 4.148x | 3.683x | 4.234x | 3.552x | 3.385x | 2.465x | 3.453x | 1.797x |
128 | 5.024x | 3.461x | 3.159x | 3.596x | 3.883x | 3.626x | 3.466x | 1.599x |
192 | 4.850x | 2.164x | 2.533x | 2.245x | 2.805x | 4.062x | 2.179x | 2.283x |
256 | 4.265x | 2.000x | 2.311x | 2.000x | 2.539x | 3.791x | 2.041x | 2.079x |
512 | 2.068x | 1.750x | 2.074x | 1.776x | 1.921x | 2.518x | 1.800x | 1.478x |
1024 | 1.808x | 1.460x | 1.563x | 1.534x | 1.616x | 2.078x | 1.490x | 1.444x |
2048 | 1.644x | 1.781x | 1.751x | 1.819x | 1.878x | 1.871x | 1.894x | 1.330x |
4096 | 1.538x | 1.394x | 1.678x | 1.401x | 1.439x | 1.686x | 1.435x | 1.332x |
As in the real case, we gain factors in the realm of 2x-4x at low precision. Multiplications are also asymptotically faster due to combining complex Karatsuba multiplication (doing 3 instead of 4 real multiplies) with real mulhighs (which acf currently doesn't do). Since dot products were already quite well optimized for acf, we never gain much more than a factor two on dot products, but even a small speedup there is nice!
By the way, the previous post reported that the precision for nfloat only goes up to 2112 bits; this has now been doubled to 4224 bits for both nfloat and nfloat_complex.
See the last section of this post for a few different views of this table.
Polynomial root-finding
Approximate floating-point arithmetic is useful in its own right, but my end goal is obviously to speed up rigorous numerics. One place where we can do so immediately is acb_poly_find_roots (and arb_fmpz_poly_complex_roots by extension), since this function first computes approximate roots and then certifies them.
The development version of FLINT now uses nfloat_complex here, falling back on acf only when the precision is too high. The result seems to be roughly a 2x speedup on many examples. I have also taken the opportunity to improve the selection of initial values. Here are some examples of how FLINT's performance has improved for isolating all the complex roots of integer polynomials:
Polynomial | Degree (deflated) | FLINT 3.1 | FLINT 3.2-dev | Speedup |
---|---|---|---|---|
Mignotte $x^{50} + (100x+1)^5$ | 50 | 0.278 s | 0.102 s | 2.73x |
Wilkinson $W_{100}(x)$ | 100 | 1.30 s | 0.52 s | 2.50x |
Chebyshev $T_{300}(x)$ | 150 | 3.75 s | 1.44 s | 2.60x |
Exponential $\sum_{i=0}^{256} x^i / i!$ | 256 | 8.397 s | 2.845 s | 2.95x |
Cyclotomic $\Phi_{777}(x)$ | 432 | 28.0 s | 0.65 s | 43.1x |
Bernoulli $B_{640}(x)$ | 640 | 114.1 s | 20.2 s | 5.65x |
Easy $\sum_{i=0}^{1000} (i+1) x^i$ | 1000 | 4134 s | 4.31 s | 959x |
A 1000-fold speedup for a degree-1000 polynomial! Well, only a factor two comes from using nfloat_complex arithmetic; the rest is due to the initial value selection.
There is much more left to improve with regard to root-finding in FLINT, but I'm probably going to leave this topic alone for the time being. Obvious things to look at are precision management, adaptive termination, and cluster detection. I should mention that I have implemented the Aberth iteration as an alternative to the current Weierstrass-Durand-Kerner method, but so far I've found that Aberth performs worse on most examples as its faster convergence is offset by the slower evaluation. I have also experimented with parallel versions, but the speedup is disappointing on a modest number of cores (the obvious ways to parallelize these algorithms slows down the speed of convergence).
Approximate linear solving
Some timings for solving a 100 by 100 linear system $Ax = b$ with complex coefficients:
prec (bits) | acf | nfloat_complex | Speedup |
---|---|---|---|
64 | 0.013 | 0.007 | 1.877x |
128 | 0.015 | 0.008 | 1.867x |
192 | 0.032 | 0.014 | 2.360x |
256 | 0.035 | 0.017 | 2.089x |
512 | 0.060 | 0.036 | 1.646x |
1024 | 0.175 | 0.109 | 1.606x |
2048 | 0.380 | 0.292 | 1.301x |
4096 | 1.341 | 0.943 | 1.422x |
As in the real case, this will be more interesting when there is fast matrix multiplication. Just like polynomial root-finding, approximate solving can be used to accelerate certified linear solving, but I have not yet implemented the switch to nfloat_complex here. The same goes for eigenvalues.
Basic operations (alternative comparisons)
Speedup versus GNU MPC
And here are the speedups compared to GNU MPC:
prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|
64 | 3.594x | 6.514x | 9.771x | 6.528x | 5.182x | 2.000x | 9.048x | 14.044x |
128 | 2.740x | 6.390x | 9.028x | 6.733x | 5.452x | 1.904x | 8.221x | 11.533x |
192 | 3.210x | 4.097x | 6.713x | 4.507x | 4.061x | 2.635x | 5.330x | 6.183x |
256 | 2.824x | 3.781x | 6.473x | 4.156x | 3.778x | 2.458x | 4.743x | 5.061x |
512 | 1.426x | 3.054x | 4.864x | 3.156x | 2.757x | 1.655x | 3.663x | 3.133x |
1024 | 1.366x | 2.152x | 2.858x | 2.192x | 2.134x | 1.534x | 2.314x | 2.234x |
2048 | 1.395x | 2.205x | 3.797x | 2.225x | 2.238x | 1.537x | 2.304x | 2.292x |
4096 | 1.382x | 1.593x | 2.836x | 1.643x | 1.649x | 1.466x | 1.637x | 1.619x |
The mpc type has faster addition than acf but slower multiplication. nfloat_complex does much better on both operations.
Cost compared to real arithmetic
Vectorized nfloat_complex operations are this much slower than those for nfloat:
prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|
64 | 1.972x | 9.660x | 10.515x | 7.766x | 5.934x | 2.244x | 4.033x | 4.646x |
128 | 2.027x | 8.560x | 8.698x | 4.860x | 3.975x | 2.056x | 3.996x | 4.172x |
192 | 1.961x | 7.030x | 4.900x | 2.472x | 3.695x | 2.222x | 5.668x | 3.812x |
256 | 2.030x | 6.300x | 4.756x | 2.350x | 3.160x | 2.138x | 5.469x | 4.010x |
512 | 2.123x | 5.429x | 5.206x | 2.139x | 2.419x | 2.000x | 5.335x | 4.071x |
1024 | 2.057x | 3.892x | 3.659x | 2.010x | 2.082x | 2.158x | 3.606x | 3.731x |
2048 | 2.008x | 3.524x | 3.479x | 2.008x | 2.046x | 2.115x | 3.480x | 3.390x |
4096 | 2.089x | 3.298x | 3.355x | 2.002x | 2.014x | 2.085x | 3.278x | 3.212x |
Complex addition is quite exactly 2x as expensive as real addition, no surprise there. Complex dot products are roughly 4x as expensive as real dot products at low precision where we have basecase multiplication, approaching 3x at high precision as we start depending on complex Karatsuba multiplication. Plain multiplications at low precision are much more than 4x expensive than their real counterparts since the additions are quite costly; more inlined vector code could probably help a bit here.
Operations per second
Some absolute performance numbers: complex Mflop/s (each complex + or × counting as one operations, with each addmul or dot product term counting as two operations):
prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|
64 | 177.0 | 71.9 | 110.0 | 70.9 | 91.3 | 100.3 | 68.5 | 242.4 |
128 | 125.0 | 49.5 | 70.4 | 50.0 | 65.1 | 88.5 | 48.5 | 144.9 |
192 | 102.1 | 27.2 | 47.2 | 27.0 | 41.4 | 80.6 | 27.2 | 63.9 |
256 | 87.0 | 23.1 | 39.2 | 22.7 | 35.3 | 72.5 | 22.4 | 49.8 |
512 | 37.6 | 11.6 | 19.4 | 11.5 | 17.7 | 42.7 | 11.8 | 20.4 |
1024 | 29.0 | 3.4 | 5.5 | 3.5 | 6.3 | 32.3 | 3.5 | 6.6 |
2048 | 21.2 | 1.2 | 2.2 | 1.2 | 2.3 | 22.8 | 1.2 | 2.4 |
4096 | 12.3 | 0.4 | 0.7 | 0.4 | 0.7 | 14.6 | 0.4 | 0.7 |
(The difference between add and sum shows the advantage of inlining inside vector operations; the former using inlined code while the latter currently does not.)
fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor