fredrikj.net / blog /

Faster, vector-friendly bigfloats in FLINT

May 19, 2024

Integer arithmetic mod multi-word $m$ recently got a big speed boost in FLINT with the addition of a flat, vector-friendly representation. We are now adding a similar representation for multi-word floating-point numbers to accelerate numerical computation.

Early results are promising. For example, solving a size-100 linear system with 256-bit precision, we currently achieve a 5.6x speedup over MPFR, a 5.1x speedup over QD and a 2.3x speedup over FLINT's existing arbitrary-precision floats (arf).

Contents

Data layout and features

An nfloat (tentative name) has a $64n$-bit mantissa where $n$ is a fixed number of words, currently in the range $1 \le n \le 33$. Thus in principle we have a list of floating-point types

all the way up to nfloat2112, though naturally the precision can be a runtime parameter. The actual size of an element is two words larger than the precision, as two header words are used to store the exponent and sign.

Key points:

Some benchmarks

I will compare performance on some basic operations against GMP's mpf, MPFR's mpfr FLINT's arbitrary-precision floating-point type arf and the ball type arb. For these benchmarks I created straightforward FLINT generics wrappers for mpf and mpfr.

I will also throw in double-double (dd) arithmetic with 106-bit precision, and quad-double (qd) arithmetic with 212-precision. For simplicity we will list these as having 128-bit and 256-bit precision respectively. I have a preliminary implementation of dd arithmetic for FLINT's generics, with inlined code for the basic vector operations to let GCC auto-vectorize things with AVX2. For qd I have just wrapped the QD library by David Bailey et al.; have in mind that this might not be optimally vectorized.

Everything here is on a AMD Ryzen 7 PRO 5850U (Zen3), single-threaded.

Addition and multiplication

On these benchmarks, we give performance numbers in Mflop/s (higher is better), first off for computing $r \gets x + y$ where $x$ and $y$ are two length-100 vectors, calling _gr_vec_add(r, x, y, 100, ctx). Precisions are 64, 128, ..., 2048 bits.

prec mpf mpfr arf arb nfloat dd/qd
64 57.1 100.0 78.1 52.1 343.6 -
128 56.5 94.3 49.5 38.0 263.9 763.4
192 55.6 64.5 42.2 32.6 210.1 -
256 55.6 62.1 40.2 32.6 180.2 60.2
512 53.5 54.6 37.5 30.9 78.7 -
1024 47.2 44.1 32.4 26.9 58.5 -
2048 39.4 31.0 25.3 21.8 42.2 -

Corresponding figures computing the elementwise product $r \gets x y$, calling _gr_vec_mul(r, x, y, 100, ctx).

prec mpf mpfr arf arb nfloat dd/qd
64 75.2 90.1 74.1 64.9 699.3 -
128 57.1 55.9 73.0 56.2 421.9 1207.7
192 50.8 39.5 63.7 37.2 187.3 -
256 40.0 34.6 57.1 34.5 141.0 26.7
512 21.9 20.5 29.9 22.7 63.3 -
1024 7.46 8.77 10.0 8.93 13.6 -
2048 2.49 3.70 3.13 3.37 4.41 -

Some nice speedups overall compared to all the arbitrary-precision types. We can observe that nfloat128 comes within a factor three of double-double arithmetic while nfloat256 is three times faster than quad-double.

It is interesting to note that nfloat additions are much slower than multiplications at lower precision, due to having very branchy code for different sign and exponent combinations. If someone knows how to do better, I'd be interested! Multiplications are extremely fast thanks to Albin Ahlb├Ąck's mulhigh assembly code in FLINT.

Multiply-add and dot product

Next, some results for the composite operation computing $r \gets r + x c $ where $r$ and $x$ are two length-100 vectors and $c$ is a fixed scalar, i.e. calling _gr_vec_addmul_scalar(r, x, 100, c, ctx). As above, performance numbers are given in Mflop/s. Note that each term here counts as two flops (one multiplication and one addition).

prec mpf mpfr arf arb nfloat dd/qd
64 70.9 120.5 77.5 58.0 680.3 -
128 59.7 91.3 57.6 45.7 389.9 854.7
192 54.8 67.1 47.8 34.0 256.4 -
256 50.0 61.0 45.1 31.7 188.7 31.0
512 31.7 37.1 31.5 25.0 81.6 -
1024 13.2 16.1 14.3 12.3 23.3 -
2048 4.69 6.97 5.31 5.60 8.10 -

Finally, results for the dot product $s \gets \sum_i x_i y_i $ where $x$ and $y$ are two length-100 vectors i.e. calling _gr_vec_dot(s, NULL, 0, x, y, 100, ctx).

prec mpf mpfr arf arb nfloat dd/qd
64 69.0 88.1 590.0 266.3 1129.9 -
128 58.3 69.2 416.7 202.2 671.1 888.9
192 54.2 48.1 125.8 102.0 305.3 -
256 49.6 44.0 111.7 92.6 222.0 31.3
512 31.6 29.0 58.3 52.2 85.8 -
1024 12.9 13.9 19.4 18.5 25.4 -
2048 4.55 6.33 6.85 6.85 8.40 -

Dot products are where we get the most performance for the buck, due to the amortized cost of additions. FLINT attempts to express other operations in terms of dot products rather than vector-scalar updates when possible for this reason. Dot products were already quite well optimized for arf and arb, and nfloat uses essentially the same algorithm but simply goes faster since there is a lot less overhead. It is worth noting that we are not so far behind dd for double-word precision!

Linear solving

For a more high-level benchmark, here is the time in seconds (lower is better) to solve $Ax = b$ where $A$ is a random well-conditioned $100 \times 100$ matrix and $b$ is a random vector, using Gaussian elimination (calling gr_mat_nonsingular_solve_lu):

prec mpf mpfr arf arb nfloat dd/qd
64 0.015 0.013 0.00356 - 0.00221 -
128 0.0154 0.0183 0.00425 - 0.00253 0.00193
192 0.0163 0.0225 0.00921 - 0.0036 -
256 0.0177 0.0243 0.0101 - 0.00435 0.0223
512 0.0255 0.0311 0.0163 - 0.00943 -
1024 0.0551 0.0546 0.044 - 0.00278 -
2048 0.15 0.115 0.0961 - 0.082 -

I have omitted arb since the algorithms are not comparable (Gaussian elimination loses significant precision in ball/interval arithmetic and it is preferable to certify a floating-point solution except at very high precision). All floating-point formats give solutions accurate within a few digits of the full precision here. For example, the first entry in the solution vector at 256-bit precision is as follows:

exact:  -1.73354315762792423119690191454237607171537884303354986117153375564639765847968...
mpf:    -1.73354315762792423119690191454237607171537884303354986117153375564639765848014
mpfr:   -1.73354315762792423119690191454237607171537884303354986117153375564639765805718    (using MPFR_RNDZ)
mpfr:   -1.73354315762792423119690191454237607171537884303354986117153375564639765874997    (using MPFR_RNDN)
arf:    -1.73354315762792423119690191454237607171537884303354986117153375564639765848190
nfloat: -1.73354315762792423119690191454237607171537884303354986117153375564639765848646
qd:     -1.7335431576279242311969019145423760717153788430335498611715430203

Currently, there is no asymptotically fast multiplication for nfloat, so for the time being it will not be competitive with arf for huge matrices. Stay tuned for future developments in that direction.

Ball arithmetic?

Obviously one of my goals is to implement faster-than-arb ball arithmetic based on the nfloat format. At the moment, I don't have an estimate for how long that will take to implement. Near term, I hope to use nfloat to speed up tasks like linear solving and root-finding which rely on certifying a heuristic numerical solution; this should be quite easy as it's just a matter of plugging in nfloat arithmetic in the algorithms which currently use arf.

There are also a bunch of potential optimizations for transcendental functions to be addressed in the future.



fredrikj.net  |  Blog index  |  RSS feed  |  Follow me on Mastodon  |  Become a sponsor