fredrikj.net / blog /

# Complex floats and some big root-finding speedups

*June 10, 2024*

The last post
reported about the addition of faster
medium-precision bigfloats (`nfloat`) to FLINT.
Today, a quick update about the complex version (`nfloat_complex`) which has been
added as well.

Contents

## Basic operations

### Speedup vs acf

Here are the speedups compared to FLINT's arbitrary-precision complex float type `acf` for various operations applied to length-100 vectors:

prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|

64 | 4.148x | 3.683x | 4.234x | 3.552x | 3.385x | 2.465x | 3.453x | 1.797x |

128 | 5.024x | 3.461x | 3.159x | 3.596x | 3.883x | 3.626x | 3.466x | 1.599x |

192 | 4.850x | 2.164x | 2.533x | 2.245x | 2.805x | 4.062x | 2.179x | 2.283x |

256 | 4.265x | 2.000x | 2.311x | 2.000x | 2.539x | 3.791x | 2.041x | 2.079x |

512 | 2.068x | 1.750x | 2.074x | 1.776x | 1.921x | 2.518x | 1.800x | 1.478x |

1024 | 1.808x | 1.460x | 1.563x | 1.534x | 1.616x | 2.078x | 1.490x | 1.444x |

2048 | 1.644x | 1.781x | 1.751x | 1.819x | 1.878x | 1.871x | 1.894x | 1.330x |

4096 | 1.538x | 1.394x | 1.678x | 1.401x | 1.439x | 1.686x | 1.435x | 1.332x |

As in the real case, we gain factors in the realm of 2x-4x at low precision.
Multiplications are also asymptotically faster due to combining complex Karatsuba multiplication (doing 3 instead of 4 real multiplies)
with real mulhighs (which `acf` currently doesn't do).
Since dot products were already quite well optimized for `acf`,
we never gain much more than a factor two on dot products, but even a small speedup there is nice!

By the way, the previous post reported that the precision for `nfloat` only goes up to 2112 bits; this has now
been doubled to 4224 bits for both `nfloat` and `nfloat_complex`.

See the last section of this post for a few different views of this table.

## Polynomial root-finding

Approximate floating-point arithmetic is useful in its own right, but my end goal
is obviously to speed up rigorous numerics.
One place where we can do so immediately
is `acb_poly_find_roots`
(and `arb_fmpz_poly_complex_roots` by extension), since this
function first computes approximate roots
and then certifies them.

The development version of FLINT now uses `nfloat_complex`
here, falling back on `acf` only when the precision is too high.
The result seems to be roughly a 2x speedup on many examples.
I have also taken the opportunity to improve the selection of initial values.
Here are some examples of how FLINT's performance has improved
for isolating all the complex roots of integer polynomials:

Polynomial | Degree (deflated) | FLINT 3.1 | FLINT 3.2-dev | Speedup |
---|---|---|---|---|

Mignotte $x^{50} + (100x+1)^5$ | 50 | 0.278 s | 0.102 s | 2.73x |

Wilkinson $W_{100}(x)$ | 100 | 1.30 s | 0.52 s | 2.50x |

Chebyshev $T_{300}(x)$ | 150 | 3.75 s | 1.44 s | 2.60x |

Exponential $\sum_{i=0}^{256} x^i / i!$ | 256 | 8.397 s | 2.845 s | 2.95x |

Cyclotomic $\Phi_{777}(x)$ | 432 | 28.0 s | 0.65 s | 43.1x |

Bernoulli $B_{640}(x)$ | 640 | 114.1 s | 20.2 s | 5.65x |

Easy $\sum_{i=0}^{1000} (i+1) x^i$ | 1000 | 4134 s | 4.31 s | 959x |

A 1000-fold speedup for a degree-1000 polynomial!
Well, only a factor two comes from using
`nfloat_complex` arithmetic; the rest
is due to the initial value selection.

There is much more left to improve with regard to root-finding in FLINT, but I'm probably going to leave this topic alone for the time being. Obvious things to look at are precision management, adaptive termination, and cluster detection. I should mention that I have implemented the Aberth iteration as an alternative to the current Weierstrass-Durand-Kerner method, but so far I've found that Aberth performs worse on most examples as its faster convergence is offset by the slower evaluation. I have also experimented with parallel versions, but the speedup is disappointing on a modest number of cores (the obvious ways to parallelize these algorithms slows down the speed of convergence).

## Approximate linear solving

Some timings for solving a 100 by 100 linear system $Ax = b$ with complex coefficients:

prec (bits) | acf | nfloat_complex | Speedup |
---|---|---|---|

64 | 0.013 | 0.007 | 1.877x |

128 | 0.015 | 0.008 | 1.867x |

192 | 0.032 | 0.014 | 2.360x |

256 | 0.035 | 0.017 | 2.089x |

512 | 0.060 | 0.036 | 1.646x |

1024 | 0.175 | 0.109 | 1.606x |

2048 | 0.380 | 0.292 | 1.301x |

4096 | 1.341 | 0.943 | 1.422x |

As in the real case, this will be more interesting when there
is fast matrix multiplication.
Just like polynomial root-finding, approximate solving can be used
to accelerate certified linear solving,
but I have not yet implemented the switch to `nfloat_complex` here.
The same goes for eigenvalues.

## Basic operations (alternative comparisons)

### Speedup versus GNU MPC

And here are the speedups compared to GNU MPC:

prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|

64 | 3.594x | 6.514x | 9.771x | 6.528x | 5.182x | 2.000x | 9.048x | 14.044x |

128 | 2.740x | 6.390x | 9.028x | 6.733x | 5.452x | 1.904x | 8.221x | 11.533x |

192 | 3.210x | 4.097x | 6.713x | 4.507x | 4.061x | 2.635x | 5.330x | 6.183x |

256 | 2.824x | 3.781x | 6.473x | 4.156x | 3.778x | 2.458x | 4.743x | 5.061x |

512 | 1.426x | 3.054x | 4.864x | 3.156x | 2.757x | 1.655x | 3.663x | 3.133x |

1024 | 1.366x | 2.152x | 2.858x | 2.192x | 2.134x | 1.534x | 2.314x | 2.234x |

2048 | 1.395x | 2.205x | 3.797x | 2.225x | 2.238x | 1.537x | 2.304x | 2.292x |

4096 | 1.382x | 1.593x | 2.836x | 1.643x | 1.649x | 1.466x | 1.637x | 1.619x |

The `mpc` type has faster addition than `acf` but slower multiplication. `nfloat_complex` does much better on both operations.

### Cost compared to real arithmetic

Vectorized `nfloat_complex` operations are this much slower than those for `nfloat`:

prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|

64 | 1.972x | 9.660x | 10.515x | 7.766x | 5.934x | 2.244x | 4.033x | 4.646x |

128 | 2.027x | 8.560x | 8.698x | 4.860x | 3.975x | 2.056x | 3.996x | 4.172x |

192 | 1.961x | 7.030x | 4.900x | 2.472x | 3.695x | 2.222x | 5.668x | 3.812x |

256 | 2.030x | 6.300x | 4.756x | 2.350x | 3.160x | 2.138x | 5.469x | 4.010x |

512 | 2.123x | 5.429x | 5.206x | 2.139x | 2.419x | 2.000x | 5.335x | 4.071x |

1024 | 2.057x | 3.892x | 3.659x | 2.010x | 2.082x | 2.158x | 3.606x | 3.731x |

2048 | 2.008x | 3.524x | 3.479x | 2.008x | 2.046x | 2.115x | 3.480x | 3.390x |

4096 | 2.089x | 3.298x | 3.355x | 2.002x | 2.014x | 2.085x | 3.278x | 3.212x |

Complex addition is quite exactly 2x as expensive as real addition, no surprise there. Complex dot products are roughly 4x as expensive as real dot products at low precision where we have basecase multiplication, approaching 3x at high precision as we start depending on complex Karatsuba multiplication. Plain multiplications at low precision are much more than 4x expensive than their real counterparts since the additions are quite costly; more inlined vector code could probably help a bit here.

### Operations per second

Some absolute performance numbers: complex Mflop/s (each complex + or × counting as one operations, with each addmul or dot product term counting as two operations):

prec (bits) | add | mul | sqr | mul_scalar | addmul_scalar | sum | product | dot |
---|---|---|---|---|---|---|---|---|

64 | 177.0 | 71.9 | 110.0 | 70.9 | 91.3 | 100.3 | 68.5 | 242.4 |

128 | 125.0 | 49.5 | 70.4 | 50.0 | 65.1 | 88.5 | 48.5 | 144.9 |

192 | 102.1 | 27.2 | 47.2 | 27.0 | 41.4 | 80.6 | 27.2 | 63.9 |

256 | 87.0 | 23.1 | 39.2 | 22.7 | 35.3 | 72.5 | 22.4 | 49.8 |

512 | 37.6 | 11.6 | 19.4 | 11.5 | 17.7 | 42.7 | 11.8 | 20.4 |

1024 | 29.0 | 3.4 | 5.5 | 3.5 | 6.3 | 32.3 | 3.5 | 6.6 |

2048 | 21.2 | 1.2 | 2.2 | 1.2 | 2.3 | 22.8 | 1.2 | 2.4 |

4096 | 12.3 | 0.4 | 0.7 | 0.4 | 0.7 | 14.6 | 0.4 | 0.7 |

(The difference between `add` and `sum` shows the advantage of inlining inside vector operations; the former using
inlined code while the latter currently does not.)

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor