Complex floats and some big root-finding speedups

June 10, 2024

The last post reported about the addition of faster medium-precision bigfloats (nfloat) to FLINT. Today, a quick update about the complex version (nfloat_complex) which has been added as well.

Contents

Basic operations
- Speedup vs acf
Polynomial root-finding
Approximate linear solving
Basic operations (alternative comparisons)

Basic operations

Speedup vs acf

Here are the speedups compared to FLINT's arbitrary-precision complex float type acf for various operations applied to length-100 vectors:

prec (bits)	add	mul	sqr	mul_scalar	addmul_scalar	sum	product	dot
64	4.148x	3.683x	4.234x	3.552x	3.385x	2.465x	3.453x	1.797x
128	5.024x	3.461x	3.159x	3.596x	3.883x	3.626x	3.466x	1.599x
192	4.850x	2.164x	2.533x	2.245x	2.805x	4.062x	2.179x	2.283x
256	4.265x	2.000x	2.311x	2.000x	2.539x	3.791x	2.041x	2.079x
512	2.068x	1.750x	2.074x	1.776x	1.921x	2.518x	1.800x	1.478x
1024	1.808x	1.460x	1.563x	1.534x	1.616x	2.078x	1.490x	1.444x
2048	1.644x	1.781x	1.751x	1.819x	1.878x	1.871x	1.894x	1.330x
4096	1.538x	1.394x	1.678x	1.401x	1.439x	1.686x	1.435x	1.332x

As in the real case, we gain factors in the realm of 2x-4x at low precision. Multiplications are also asymptotically faster due to combining complex Karatsuba multiplication (doing 3 instead of 4 real multiplies) with real mulhighs (which acf currently doesn't do). Since dot products were already quite well optimized for acf, we never gain much more than a factor two on dot products, but even a small speedup there is nice!

By the way, the previous post reported that the precision for nfloat only goes up to 2112 bits; this has now been doubled to 4224 bits for both nfloat and nfloat_complex.

See the last section of this post for a few different views of this table.

Polynomial root-finding

Approximate floating-point arithmetic is useful in its own right, but my end goal is obviously to speed up rigorous numerics. One place where we can do so immediately is acb_poly_find_roots (and arb_fmpz_poly_complex_roots by extension), since this function first computes approximate roots and then certifies them.

The development version of FLINT now uses nfloat_complex here, falling back on acf only when the precision is too high. The result seems to be roughly a 2x speedup on many examples. I have also taken the opportunity to improve the selection of initial values. Here are some examples of how FLINT's performance has improved for isolating all the complex roots of integer polynomials:

Polynomial	Degree (deflated)	FLINT 3.1	FLINT 3.2-dev	Speedup
Mignotte $x^{50} + (100x+1)^5$	50	0.278 s	0.102 s	2.73x
Wilkinson $W_{100}(x)$	100	1.30 s	0.52 s	2.50x
Chebyshev $T_{300}(x)$	150	3.75 s	1.44 s	2.60x
Exponential $\sum_{i=0}^{256} x^i / i!$	256	8.397 s	2.845 s	2.95x
Cyclotomic $\Phi_{777}(x)$	432	28.0 s	0.65 s	43.1x
Bernoulli $B_{640}(x)$	640	114.1 s	20.2 s	5.65x
Easy $\sum_{i=0}^{1000} (i+1) x^i$	1000	4134 s	4.31 s	959x

A 1000-fold speedup for a degree-1000 polynomial! Well, only a factor two comes from using nfloat_complex arithmetic; the rest is due to the initial value selection.

There is much more left to improve with regard to root-finding in FLINT, but I'm probably going to leave this topic alone for the time being. Obvious things to look at are precision management, adaptive termination, and cluster detection. I should mention that I have implemented the Aberth iteration as an alternative to the current Weierstrass-Durand-Kerner method, but so far I've found that Aberth performs worse on most examples as its faster convergence is offset by the slower evaluation. I have also experimented with parallel versions, but the speedup is disappointing on a modest number of cores (the obvious ways to parallelize these algorithms slows down the speed of convergence).

Approximate linear solving

Some timings for solving a 100 by 100 linear system $Ax = b$ with complex coefficients:

prec (bits)	acf	nfloat_complex	Speedup
64	0.013	0.007	1.877x
128	0.015	0.008	1.867x
192	0.032	0.014	2.360x
256	0.035	0.017	2.089x
512	0.060	0.036	1.646x
1024	0.175	0.109	1.606x
2048	0.380	0.292	1.301x
4096	1.341	0.943	1.422x

As in the real case, this will be more interesting when there is fast matrix multiplication. Just like polynomial root-finding, approximate solving can be used to accelerate certified linear solving, but I have not yet implemented the switch to nfloat_complex here. The same goes for eigenvalues.

Basic operations (alternative comparisons)

Speedup versus GNU MPC

And here are the speedups compared to GNU MPC:

prec (bits)	add	mul	sqr	mul_scalar	addmul_scalar	sum	product	dot
64	3.594x	6.514x	9.771x	6.528x	5.182x	2.000x	9.048x	14.044x
128	2.740x	6.390x	9.028x	6.733x	5.452x	1.904x	8.221x	11.533x
192	3.210x	4.097x	6.713x	4.507x	4.061x	2.635x	5.330x	6.183x
256	2.824x	3.781x	6.473x	4.156x	3.778x	2.458x	4.743x	5.061x
512	1.426x	3.054x	4.864x	3.156x	2.757x	1.655x	3.663x	3.133x
1024	1.366x	2.152x	2.858x	2.192x	2.134x	1.534x	2.314x	2.234x
2048	1.395x	2.205x	3.797x	2.225x	2.238x	1.537x	2.304x	2.292x
4096	1.382x	1.593x	2.836x	1.643x	1.649x	1.466x	1.637x	1.619x

The mpc type has faster addition than acf but slower multiplication. nfloat_complex does much better on both operations.

Cost compared to real arithmetic

Vectorized nfloat_complex operations are this much slower than those for nfloat:

prec (bits)	add	mul	sqr	mul_scalar	addmul_scalar	sum	product	dot
64	1.972x	9.660x	10.515x	7.766x	5.934x	2.244x	4.033x	4.646x
128	2.027x	8.560x	8.698x	4.860x	3.975x	2.056x	3.996x	4.172x
192	1.961x	7.030x	4.900x	2.472x	3.695x	2.222x	5.668x	3.812x
256	2.030x	6.300x	4.756x	2.350x	3.160x	2.138x	5.469x	4.010x
512	2.123x	5.429x	5.206x	2.139x	2.419x	2.000x	5.335x	4.071x
1024	2.057x	3.892x	3.659x	2.010x	2.082x	2.158x	3.606x	3.731x
2048	2.008x	3.524x	3.479x	2.008x	2.046x	2.115x	3.480x	3.390x
4096	2.089x	3.298x	3.355x	2.002x	2.014x	2.085x	3.278x	3.212x

Complex addition is quite exactly 2x as expensive as real addition, no surprise there. Complex dot products are roughly 4x as expensive as real dot products at low precision where we have basecase multiplication, approaching 3x at high precision as we start depending on complex Karatsuba multiplication. Plain multiplications at low precision are much more than 4x expensive than their real counterparts since the additions are quite costly; more inlined vector code could probably help a bit here.

Operations per second

Some absolute performance numbers: complex Mflop/s (each complex + or × counting as one operations, with each addmul or dot product term counting as two operations):

prec (bits)	add	mul	sqr	mul_scalar	addmul_scalar	sum	product	dot
64	177.0	71.9	110.0	70.9	91.3	100.3	68.5	242.4
128	125.0	49.5	70.4	50.0	65.1	88.5	48.5	144.9
192	102.1	27.2	47.2	27.0	41.4	80.6	27.2	63.9
256	87.0	23.1	39.2	22.7	35.3	72.5	22.4	49.8
512	37.6	11.6	19.4	11.5	17.7	42.7	11.8	20.4
1024	29.0	3.4	5.5	3.5	6.3	32.3	3.5	6.6
2048	21.2	1.2	2.2	1.2	2.3	22.8	1.2	2.4
4096	12.3	0.4	0.7	0.4	0.7	14.6	0.4	0.7

(The difference between add and sum shows the advantage of inlining inside vector operations; the former using inlined code while the latter currently does not.)

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor