Optimizing multiprecision LLL in FLINT

August 21, 2025

This post describes some recent improvements to the multiprecision arithmetic in FLINT's LLL lattice basis reduction. LLL has many important applications, including two key tasks where it is used internally by FLINT:

Factoring in $\mathbb{Z}[x]$ using van Hoeij's algorithm.
Integer relation searches as part of equality proving and symbolic simplification in the Calcium implementation of $\overline{\mathbb{Q}}$ and $\mathbb{C}$.

As we shall see, these two applications have gotten significantly faster in the current development version of FLINT in large dimensions, with speedups up to 6x and 60x observed on benchmarks for factoring and integer relations respectively.

Contents

Overview of FLINT's LLL
Changes
Polynomial factoring benchmark
Integer relation benchmark
Future

Overview of FLINT's LLL

FLINT's LLL incorporates ideas from the following articles:

Damien Stehlé, Floating-Point LLL: Theoretical and Practical Aspects
Gilles Villard, Certification of the QR factor R and of lattice basis reducedness

The idea is essentially the following: we first do an heuristic LLL reduction using floating-point arithmetic and then rigorously certify its correctness using posteriori analysis of an approximate QR factorization. The certification itself can use floating-point arithmetic if we make judicious use of directed rounding. This strategy is significantly faster than executing the original LLL algorithm in exact rational arithmetic.

Both the heuristic LLL reduction and the certification come in two versions: a fast machine precision implementation, and a slower multiprecision implementation used as a fallback when machine precision fails. This typically becomes necessary in large enough dimensions. Finally, there is an exact certification (using $\mathbb{Q}$ arithmetic and the naive definition of LLL-reducedness) used as an ultimate fallback when even the multiprecision floating-point QR certification is inconclusive.

(Also worth noting is that for lattices with large bit size, FLINT reduces to smaller bit sizes using "ULLL" an unpublished algorithm by Andrew Novocin but inspired by the $\tilde L^1$ algorithm from Andrew Novocin, Damien Stehlé, Gilles Villard, An LLL-reduction algorithm with quasi-linear time complexity. However, this high-level wrapper is largely independent of the underlying floating-point LLLs.)

Changes

Over the last month, I made the following improvements to FLINT's LLL code:

The heuristic multiprecision LLL previously used GMP's mpf type. I changed it to use FLINT's faster nfloat instead.
The multiprecision floating-point certification previously used MPFR's mpfr type with directed rounding. I changed it to use nfloat. This required adding directed rounding support for nfloat as a new feature.
The heuristic multiprecision LLL previously attempted to detect cancellations in dot products in order to fall back to a more precise calculation when too many accurate digits are lost. This cancellation check was implemented by computing norms of the input vectors, which is about as expensive as actually computing the dot product. It would be possible to optimize the cancellation check, but I decided to just remove this trick entirely as it does not seem to actually help much in practice. (Before removing this trick, I was reassurred to find that Stehlé's article also discusses the fact that it might be a net slowdown, confirming my own limited tests.)
When the multiprecision floating-point QR certification fails, FLINT would previously jump to naively checking LLL-reducedness using exact fmpq arithmetic as noted above. This suffers from horrendous coefficient blowup. I thus inserted another naive LLL-reducedness check with high precision arb ball arithmetic before the fmpq test. (This should make the fmpq fallback extremely rare.)

Polynomial factoring benchmark

Here are the results on FLINT's benchmark suite for factoring in $\mathbb{Z}[x]$ (running build/fmpz_poly_factor/profile/p-factor_hard -hard).

Quick explanation: each input polynomial has the specified degree and coefficients smaller than $\pm 2^{\text{Bits}}$. "Factors" is the number of factors over $\mathbb{Z}[x]$ and "Local factors" is the number of factors over $\mathbb{Z}_p[x]$ (usually larger than the number of global factors; it is smaller in some entries in the table below due to a deflation hack). A large number of local factors is what necessitates running LLL to find the correct recombination; the dimension of the lattice is roughly that of the number of local factors.

Polynomial	Degree	Bits	Local factors	Factors	FLINT 3.3	FLINT 3.4-dev	Speedup
P1	156	1407	10	36	0.007 s	0.007 s
P2	196	1391	7	12	0.011 s	0.011 s
P3	336	1982	7	16	0.022 s	0.022 s
P4	462	2511	42	2	0.159 s	0.158 s
P5	64	131	32	1	0.007 s	0.007 s
P6	144	548	24	6	0.010 s	0.010 s
P7	384	187	88	1	0.382 s	0.218 s	1.50x
P8	972	707	54	1	0.120 s	0.114 s
M12_5	792	3623	72	1	0.376 s	0.328 s	1.14x
M12_6	924	4870	84	2	4.920 s	4.851 s
T1	900	915	30	2	0.102 s	0.102 s
T2	900	421	32	2	0.108 s	0.110 s
T3	2401	1541	63	4	1.909 s	1.768 s	1.08x
P7*M12_5	1176	3787	188	2	278 s	51.2 s	5.43x
P7*M12_6	1308	5029	206	3	535 s	96.5 s	5.54x
H1	960	955	131	28	1.927 s	0.935 s	2.06x
H2	4096	2	256	6	444 s	71.4 s	6.21x
C1	1024	478	64	32	0.556 s	0.256 s	2.17x
S7	128	289	64	1	0.134 s	0.088 s	1.52x
S8	256	624	128	1	1.431 s	1.077 s	1.32x
S9	512	1334	256	1	372 s	57.2 s	6.50x
S10	1024	2837	512	1	28548 s	6292 s	4.53x

We observe some very nice speedups on all examples with more than 100 local factors where the multiprecision code in the LLL typically really starts to kick in, and on some smaller examples too. For the largest examples there seems to be a pretty consistent 4x to 6x speedup, with the time for the adversarial Swinnerton-Dyer polynomial S10 dropping all the way from eight hours to less than two hours!

Actually, S9 (and presumably also S10) is still slower than in FLINT 1.6 of ten years ago (Bill Hart reported this example taking 21 seconds with FLINT 1.6 in 2016) due to some missing heuristics in the FLINT 2 (and FLINT 3) factoring code. These heuristics presumably result in easier lattice problems. Nevertheless, it is interesting that we partly seem to catch up just by improving LLL performance for "hard" lattices.

Integer relation benchmark

I added a new profile program (build/fmpz_lll/profile/p-lindep) to test performance on integer relation lattices. This program generates a vector of $n$ uniformly random real numbers, inserts $\log(2)$, $\log(3)$, $\log(6)$ at positions $0.25n$, $0.5$ and $0.75n$, and runs _qqbar_acb_lindep with precision 64, 128, 256, ... bits until the relation $(\ldots, 0, \pm 1, 0, \ldots, 0, \pm 1, 0, \ldots, 0, \mp 1, 0, \ldots)$ is recovered. The "Precision" column in the table shows at what input precision (= bit size of lattice) this occurs.

n	Precision	FLINT 3.3	FLINT 3.4-dev	Speedup
10	64	0.000063 s	0.000056 s	1.12x
20	64	0.000424 s	0.000370 s	1.15x
30	64	0.00104 s	0.000921 s	1.13x
40	64	0.00182 s	0.00165 s	1.10x
50	128	0.00975 s	0.00873 s	1.12x
60	128	0.0134 s	0.0122 s	1.10x
70	128	0.0166 s	0.0153 s	1.08x
80	256	0.0598 s	0.0551 s	1.09x
90	256	0.0692 s	0.0640 s	1.08x
100	256	0.0854 s	0.0795 s	1.07x
110	512	0.386 s	0.363 s	1.06x
120	512	0.415 s	0.390 s	1.06x
130	512	0.488 s	0.464 s	1.05x
140	512	0.547 s	0.518 s	1.06x
150	512	0.598 s	0.568 s	1.05x
160	1024	2.457 s	2.355 s	1.04x
170	1024	2.679 s	2.571 s	1.04x
180	1024	2.860 s	2.750 s	1.04x
190	1024	3.163 s	3.054 s	1.04x
200	1024	3.412 s	3.306 s	1.03x
210	2048	33.59 s	18.53 s	1.81x
220	2048	212.3 s	19.78 s	10.73x
230	2048	56.78 s	19.17 s	2.96x
240	2048	245.0 s	20.05 s	12.22x
250	2048	36.11 s	17.63 s	2.05x
260	2048	286.6 s	20.47 s	14.00x
270	2048	316.3 s	22.89 s	13.82x
280	2048	368.7 s	26.20 s	14.08x
290	4096	1522 s	24.35 s	62.51x
300	4096	1532 s	150.0 s	10.21x
310	4096	1687 s	155.0 s	10.88x
320	4096	1856 s	160.4 s	11.57x

One can think of more interesting integer relation problems (this one would be solvable by brute force), but this one seemed appropriate for a simple benchmark.

One notices significant improvements for $n$ larger than about 200, with a consistent order-of-magnitude speedup for $n$ larger than 250 and a 60x speedup in one case ($n = 290$). It is worth noting that the running time for the new LLL code increases much more smoothly with $n$: the FLINT 3.3 LLL would randomly take 5-10 times longer for some $n$ compared to adjacent one due to hitting the slow fmpq fallback, which is now avoided.

Future

There is much more to optimize about FLINT's LLL.

The high-level algorithm was originally designed specifically for knapsack lattices (which occur in van Hoeij's algorithm), so various parameters could presumably be tuned differently for for other kinds of lattices.
The machine precision and integer arithmetic could be optimized.
Certification could use ball arithmetic instead of floating-point arithmetic with directed rounding.
Multithreading is not completely straightforward, but should be doable.
Recently, new lattice reduction algorithms have been published that perform much better than the "classical" floating-point LLL at least in some applications, more effectively reducing the work to block linear algebra (QR factorization and matrix multiplication). See FLATTER by Keegan Ryan and Nadia Heninger, and BLASTER by Léo Ducas, Ludo Pulles and Marc Stevens. It would be interesting to implement these algorithms using FLINT's multiprecision nfloat linear algebra.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor