Faster, vector-friendly bigfloats in FLINT

May 19, 2024

Integer arithmetic mod multi-word $m$ recently got a big speed boost in FLINT with the addition of a flat, vector-friendly representation. We are now adding a similar representation for multi-word floating-point numbers to accelerate numerical computation.

Early results are promising. For example, solving a size-100 linear system with 256-bit precision, we currently achieve a 5.6x speedup over MPFR, a 5.1x speedup over QD and a 2.3x speedup over FLINT's existing arbitrary-precision floats (arf).

Contents

Data layout and features
Some benchmarks
Ball arithmetic?

Data layout and features

An nfloat (tentative name) has a $64n$-bit mantissa where $n$ is a fixed number of words, currently in the range $1 \le n \le 33$. Thus in principle we have a list of floating-point types

nfloat64
nfloat128
nfloat192
nfloat256
...

all the way up to nfloat2112, though naturally the precision can be a runtime parameter. The actual size of an element is two words larger than the precision, as two header words are used to store the exponent and sign.

Key points:

Elements are represented as flat chunks of words, without pointers. They can therefore be created on the stack, placed contiguously in vectors, and moved or copied freely without need for destructors or garbage collection.
The radix is 2, with automatic mantissa normalization (like mpfr or arf). This is not clearly the best choice since radix $2^{64}$ (like GMP's mpf) should allow for faster additions, but radix 2 is more efficient for multiplications and arguably more user-friendly.
The exponent is a full word, minus two bits reserved for detecting overflow and underflow more easily. This means that every nfloat no matter the precision has a range of around $10^{\pm 694127911065419641}$. There are no subnormals (underflow is to zero or can be intercepted).
The interface and implementation use FLINT 3's generics. The user specifies a precision by creating a context object. A context object thus defines a floating-point "field" with fixed, immutable precision. The context object allows configuring certain other properties such as error handling and support for NaNs and infinities. Currently, NaNs and infinities are not fully implemented and are disabled by default (invalid results return status flags instead); this has some speed benefits, since one can omit checks for such elements.
The default arithmetic uses sloppy rounding for speed: operations are done with somewhat more than 1 ulp error by default. I plan to add directed sloppy rounding (returning guaranteed lower and upper bounds, though not necessarily tight) in the future. Support for correct rounding is not a major goal (it is necessarily slower, and very few applications actually require it) but this might be done later as an optional extra.
Vector operations including add, sub, mul, mul_scalar, addmul_scalar and dot have dedicated code with inlined arithmetic. This is much faster than elementwise function calls when one has many entries to process.

Some benchmarks

I will compare performance on some basic operations against GMP's mpf, MPFR's mpfr FLINT's arbitrary-precision floating-point type arf and the ball type arb. For these benchmarks I created straightforward FLINT generics wrappers for mpf and mpfr.

I will also throw in double-double (dd) arithmetic with 106-bit precision, and quad-double (qd) arithmetic with 212-precision. For simplicity we will list these as having 128-bit and 256-bit precision respectively. I have a preliminary implementation of dd arithmetic for FLINT's generics, with inlined code for the basic vector operations to let GCC auto-vectorize things with AVX2. For qd I have just wrapped the QD library by David Bailey et al.; have in mind that this might not be optimally vectorized.

Everything here is on a AMD Ryzen 7 PRO 5850U (Zen3), single-threaded.

Addition and multiplication

On these benchmarks, we give performance numbers in Mflop/s (higher is better), first off for computing $r \gets x + y$ where $x$ and $y$ are two length-100 vectors, calling _gr_vec_add(r, x, y, 100, ctx). Precisions are 64, 128, ..., 2048 bits.

prec	mpf	mpfr	arf	arb	nfloat	dd/qd
64	57.1	100.0	78.1	52.1	343.6	-
128	56.5	94.3	49.5	38.0	263.9	763.4
192	55.6	64.5	42.2	32.6	210.1	-
256	55.6	62.1	40.2	32.6	180.2	60.2
512	53.5	54.6	37.5	30.9	78.7	-
1024	47.2	44.1	32.4	26.9	58.5	-
2048	39.4	31.0	25.3	21.8	42.2	-

Corresponding figures computing the elementwise product $r \gets x y$, calling _gr_vec_mul(r, x, y, 100, ctx).

prec	mpf	mpfr	arf	arb	nfloat	dd/qd
64	75.2	90.1	74.1	64.9	699.3	-
128	57.1	55.9	73.0	56.2	421.9	1207.7
192	50.8	39.5	63.7	37.2	187.3	-
256	40.0	34.6	57.1	34.5	141.0	26.7
512	21.9	20.5	29.9	22.7	63.3	-
1024	7.46	8.77	10.0	8.93	13.6	-
2048	2.49	3.70	3.13	3.37	4.41	-

Some nice speedups overall compared to all the arbitrary-precision types. We can observe that nfloat128 comes within a factor three of double-double arithmetic while nfloat256 is three times faster than quad-double.

It is interesting to note that nfloat additions are much slower than multiplications at lower precision, due to having very branchy code for different sign and exponent combinations. If someone knows how to do better, I'd be interested! Multiplications are extremely fast thanks to Albin Ahlbäck's mulhigh assembly code in FLINT.

Multiply-add and dot product

Next, some results for the composite operation computing $r \gets r + x c $ where $r$ and $x$ are two length-100 vectors and $c$ is a fixed scalar, i.e. calling _gr_vec_addmul_scalar(r, x, 100, c, ctx). As above, performance numbers are given in Mflop/s. Note that each term here counts as two flops (one multiplication and one addition).

prec	mpf	mpfr	arf	arb	nfloat	dd/qd
64	70.9	120.5	77.5	58.0	680.3	-
128	59.7	91.3	57.6	45.7	389.9	854.7
192	54.8	67.1	47.8	34.0	256.4	-
256	50.0	61.0	45.1	31.7	188.7	31.0
512	31.7	37.1	31.5	25.0	81.6	-
1024	13.2	16.1	14.3	12.3	23.3	-
2048	4.69	6.97	5.31	5.60	8.10	-

Finally, results for the dot product $s \gets \sum_i x_i y_i $ where $x$ and $y$ are two length-100 vectors i.e. calling _gr_vec_dot(s, NULL, 0, x, y, 100, ctx).

prec	mpf	mpfr	arf	arb	nfloat	dd/qd
64	69.0	88.1	590.0	266.3	1129.9	-
128	58.3	69.2	416.7	202.2	671.1	888.9
192	54.2	48.1	125.8	102.0	305.3	-
256	49.6	44.0	111.7	92.6	222.0	31.3
512	31.6	29.0	58.3	52.2	85.8	-
1024	12.9	13.9	19.4	18.5	25.4	-
2048	4.55	6.33	6.85	6.85	8.40	-

Dot products are where we get the most performance for the buck, due to the amortized cost of additions. FLINT attempts to express other operations in terms of dot products rather than vector-scalar updates when possible for this reason. Dot products were already quite well optimized for arf and arb, and nfloat uses essentially the same algorithm but simply goes faster since there is a lot less overhead. It is worth noting that we are not so far behind dd for double-word precision!

Linear solving

For a more high-level benchmark, here is the time in seconds (lower is better) to solve $Ax = b$ where $A$ is a random well-conditioned $100 \times 100$ matrix and $b$ is a random vector, using Gaussian elimination (calling gr_mat_nonsingular_solve_lu):

prec	mpf	mpfr	arf	arb	nfloat	dd/qd
64	0.015	0.013	0.00356	-	0.00221	-
128	0.0154	0.0183	0.00425	-	0.00253	0.00193
192	0.0163	0.0225	0.00921	-	0.0036	-
256	0.0177	0.0243	0.0101	-	0.00435	0.0223
512	0.0255	0.0311	0.0163	-	0.00943	-
1024	0.0551	0.0546	0.044	-	0.0278	-
2048	0.15	0.115	0.0961	-	0.082	-

I have omitted arb since the algorithms are not comparable (Gaussian elimination loses significant precision in ball/interval arithmetic and it is preferable to certify a floating-point solution except at very high precision). All floating-point formats give solutions accurate within a few digits of the full precision here. For example, the first entry in the solution vector at 256-bit precision is as follows:

exact:  -1.73354315762792423119690191454237607171537884303354986117153375564639765847968...
mpf:    -1.73354315762792423119690191454237607171537884303354986117153375564639765848014
mpfr:   -1.73354315762792423119690191454237607171537884303354986117153375564639765805718    (using MPFR_RNDZ)
mpfr:   -1.73354315762792423119690191454237607171537884303354986117153375564639765874997    (using MPFR_RNDN)
arf:    -1.73354315762792423119690191454237607171537884303354986117153375564639765848190
nfloat: -1.73354315762792423119690191454237607171537884303354986117153375564639765848646
qd:     -1.7335431576279242311969019145423760717153788430335498611715430203

Currently, there is no asymptotically fast multiplication for nfloat, so for the time being it will not be competitive with arf for huge matrices. Stay tuned for future developments in that direction.

Ball arithmetic?

Obviously one of my goals is to implement faster-than-arb ball arithmetic based on the nfloat format. At the moment, I don't have an estimate for how long that will take to implement. Near term, I hope to use nfloat to speed up tasks like linear solving and root-finding which rely on certifying a heuristic numerical solution; this should be quite easy as it's just a matter of plugging in nfloat arithmetic in the algorithms which currently use arf.

There are also a bunch of potential optimizations for transcendental functions to be addressed in the future.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor