fredrikj.net / blog /
Optimizing multiprecision LLL in FLINT
August 21, 2025
This post describes some recent improvements to the multiprecision arithmetic in FLINT's LLL lattice basis reduction. LLL has many important applications, including two key tasks where it is used internally by FLINT:
- Factoring in $\mathbb{Z}[x]$ using van Hoeij's algorithm.
- Integer relation searches as part of equality proving and symbolic simplification in the Calcium implementation of $\overline{\mathbb{Q}}$ and $\mathbb{C}$.
Contents
Overview of FLINT's LLL
FLINT's LLL incorporates ideas from the following articles:
- Damien Stehlé, Floating-Point LLL: Theoretical and Practical Aspects
- Gilles Villard, Certification of the QR factor R and of lattice basis reducedness
The idea is essentially the following: we first do an heuristic LLL reduction using floating-point arithmetic and then rigorously certify its correctness using posteriori analysis of an approximate QR factorization. The certification itself can use floating-point arithmetic if we make judicious use of directed rounding. This strategy is significantly faster than executing the original LLL algorithm in exact rational arithmetic.
Both the heuristic LLL reduction and the certification come in two versions: a fast machine precision implementation, and a slower multiprecision implementation used as a fallback when machine precision fails. This typically becomes necessary in large enough dimensions. Finally, there is an exact certification (using $\mathbb{Q}$ arithmetic and the naive definition of LLL-reducedness) used as an ultimate fallback when even the multiprecision floating-point QR certification is inconclusive.
(Also worth noting is that for lattices with large bit size, FLINT reduces to smaller bit sizes using "ULLL" an unpublished algorithm by Andrew Novocin but inspired by the $\tilde L^1$ algorithm from Andrew Novocin, Damien Stehlé, Gilles Villard, An LLL-reduction algorithm with quasi-linear time complexity. However, this high-level wrapper is largely independent of the underlying floating-point LLLs.)
Changes
Over the last month, I made the following improvements to FLINT's LLL code:
- The heuristic multiprecision LLL previously used GMP's mpf type. I changed it to use FLINT's faster nfloat instead.
- The multiprecision floating-point certification previously used MPFR's mpfr type with directed rounding. I changed it to use nfloat. This required adding directed rounding support for nfloat as a new feature.
- The heuristic multiprecision LLL previously attempted to detect cancellations in dot products in order to fall back to a more precise calculation when too many accurate digits are lost. This cancellation check was implemented by computing norms of the input vectors, which is about as expensive as actually computing the dot product. It would be possible to optimize the cancellation check, but I decided to just remove this trick entirely as it does not seem to actually help much in practice. (Before removing this trick, I was reassurred to find that Stehlé's article also discusses the fact that it might be a net slowdown, confirming my own limited tests.)
- When the multiprecision floating-point QR certification fails, FLINT would previously jump to naively checking LLL-reducedness using exact fmpq arithmetic as noted above. This suffers from horrendous coefficient blowup. I thus inserted another naive LLL-reducedness check with high precision arb ball arithmetic before the fmpq test. (This should make the fmpq fallback extremely rare.)
Polynomial factoring benchmark
Here are the results on FLINT's benchmark suite for factoring in $\mathbb{Z}[x]$ (running build/fmpz_poly_factor/profile/p-factor_hard -hard).
Quick explanation: each input polynomial has the specified degree and coefficients smaller than $\pm 2^{\text{Bits}}$. "Factors" is the number of factors over $\mathbb{Z}[x]$ and "Local factors" is the number of factors over $\mathbb{Z}_p[x]$ (usually larger than the number of global factors; it is smaller in some entries in the table below due to a deflation hack). A large number of local factors is what necessitates running LLL to find the correct recombination; the dimension of the lattice is roughly that of the number of local factors.
Polynomial | Degree | Bits | Local factors | Factors | FLINT 3.3 | FLINT 3.4-dev | Speedup |
---|---|---|---|---|---|---|---|
P1 | 156 | 1407 | 10 | 36 | 0.007 s | 0.007 s | |
P2 | 196 | 1391 | 7 | 12 | 0.011 s | 0.011 s | |
P3 | 336 | 1982 | 7 | 16 | 0.022 s | 0.022 s | |
P4 | 462 | 2511 | 42 | 2 | 0.159 s | 0.158 s | |
P5 | 64 | 131 | 32 | 1 | 0.007 s | 0.007 s | |
P6 | 144 | 548 | 24 | 6 | 0.010 s | 0.010 s | |
P7 | 384 | 187 | 88 | 1 | 0.382 s | 0.218 s | 1.50x |
P8 | 972 | 707 | 54 | 1 | 0.120 s | 0.114 s | |
M12_5 | 792 | 3623 | 72 | 1 | 0.376 s | 0.328 s | 1.14x |
M12_6 | 924 | 4870 | 84 | 2 | 4.920 s | 4.851 s | |
T1 | 900 | 915 | 30 | 2 | 0.102 s | 0.102 s | |
T2 | 900 | 421 | 32 | 2 | 0.108 s | 0.110 s | |
T3 | 2401 | 1541 | 63 | 4 | 1.909 s | 1.768 s | 1.08x |
P7*M12_5 | 1176 | 3787 | 188 | 2 | 278 s | 51.2 s | 5.43x |
P7*M12_6 | 1308 | 5029 | 206 | 3 | 535 s | 96.5 s | 5.54x |
H1 | 960 | 955 | 131 | 28 | 1.927 s | 0.935 s | 2.06x |
H2 | 4096 | 2 | 256 | 6 | 444 s | 71.4 s | 6.21x |
C1 | 1024 | 478 | 64 | 32 | 0.556 s | 0.256 s | 2.17x |
S7 | 128 | 289 | 64 | 1 | 0.134 s | 0.088 s | 1.52x |
S8 | 256 | 624 | 128 | 1 | 1.431 s | 1.077 s | 1.32x |
S9 | 512 | 1334 | 256 | 1 | 372 s | 57.2 s | 6.50x |
S10 | 1024 | 2837 | 512 | 1 | 28548 s | 6292 s | 4.53x |
We observe some very nice speedups on all examples with more than 100 local factors where the multiprecision code in the LLL typically really starts to kick in, and on some smaller examples too. For the largest examples there seems to be a pretty consistent 4x to 6x speedup, with the time for the adversarial Swinnerton-Dyer polynomial S10 dropping all the way from eight hours to less than two hours!
Actually, S9 (and presumably also S10) is still slower than in FLINT 1.6 of ten years ago (Bill Hart reported this example taking 21 seconds with FLINT 1.6 in 2016) due to some missing heuristics in the FLINT 2 (and FLINT 3) factoring code. These heuristics presumably result in easier lattice problems. Nevertheless, it is interesting that we partly seem to catch up just by improving LLL performance for "hard" lattices.
Integer relation benchmark
I added a new profile program (build/fmpz_lll/profile/p-lindep) to test performance on integer relation lattices. This program generates a vector of $n$ uniformly random real numbers, inserts $\log(2)$, $\log(3)$, $\log(6)$ at positions $0.25n$, $0.5$ and $0.75n$, and runs _qqbar_acb_lindep with precision 64, 128, 256, ... bits until the relation $(\ldots, 0, \pm 1, 0, \ldots, 0, \pm 1, 0, \ldots, 0, \mp 1, 0, \ldots)$ is recovered. The "Precision" column in the table shows at what input precision (= bit size of lattice) this occurs.
n | Precision | FLINT 3.3 | FLINT 3.4-dev | Speedup |
---|---|---|---|---|
10 | 64 | 0.000063 s | 0.000056 s | 1.12x |
20 | 64 | 0.000424 s | 0.000370 s | 1.15x |
30 | 64 | 0.00104 s | 0.000921 s | 1.13x |
40 | 64 | 0.00182 s | 0.00165 s | 1.10x |
50 | 128 | 0.00975 s | 0.00873 s | 1.12x |
60 | 128 | 0.0134 s | 0.0122 s | 1.10x |
70 | 128 | 0.0166 s | 0.0153 s | 1.08x |
80 | 256 | 0.0598 s | 0.0551 s | 1.09x |
90 | 256 | 0.0692 s | 0.0640 s | 1.08x |
100 | 256 | 0.0854 s | 0.0795 s | 1.07x |
110 | 512 | 0.386 s | 0.363 s | 1.06x |
120 | 512 | 0.415 s | 0.390 s | 1.06x |
130 | 512 | 0.488 s | 0.464 s | 1.05x |
140 | 512 | 0.547 s | 0.518 s | 1.06x |
150 | 512 | 0.598 s | 0.568 s | 1.05x |
160 | 1024 | 2.457 s | 2.355 s | 1.04x |
170 | 1024 | 2.679 s | 2.571 s | 1.04x |
180 | 1024 | 2.860 s | 2.750 s | 1.04x |
190 | 1024 | 3.163 s | 3.054 s | 1.04x |
200 | 1024 | 3.412 s | 3.306 s | 1.03x |
210 | 2048 | 33.59 s | 18.53 s | 1.81x |
220 | 2048 | 212.3 s | 19.78 s | 10.73x |
230 | 2048 | 56.78 s | 19.17 s | 2.96x |
240 | 2048 | 245.0 s | 20.05 s | 12.22x |
250 | 2048 | 36.11 s | 17.63 s | 2.05x |
260 | 2048 | 286.6 s | 20.47 s | 14.00x |
270 | 2048 | 316.3 s | 22.89 s | 13.82x |
280 | 2048 | 368.7 s | 26.20 s | 14.08x |
290 | 4096 | 1522 s | 24.35 s | 62.51x |
300 | 4096 | 1532 s | 150.0 s | 10.21x |
310 | 4096 | 1687 s | 155.0 s | 10.88x |
320 | 4096 | 1856 s | 160.4 s | 11.57x |
One can think of more interesting integer relation problems (this one would be solvable by brute force), but this one seemed appropriate for a simple benchmark.
One notices significant improvements for $n$ larger than about 200, with a consistent order-of-magnitude speedup for $n$ larger than 250 and a 60x speedup in one case ($n = 290$). It is worth noting that the running time for the new LLL code increases much more smoothly with $n$: the FLINT 3.3 LLL would randomly take 5-10 times longer for some $n$ compared to adjacent one due to hitting the slow fmpq fallback, which is now avoided.
Future
There is much more to optimize about FLINT's LLL.
- The high-level algorithm was originally designed specifically for knapsack lattices (which occur in van Hoeij's algorithm), so various parameters could presumably be tuned differently for for other kinds of lattices.
- The machine precision and integer arithmetic could be optimized.
- Certification could use ball arithmetic instead of floating-point arithmetic with directed rounding.
- Multithreading is not completely straightforward, but should be doable.
- Recently, new lattice reduction algorithms have been published that perform much better than the "classical" floating-point LLL at least in some applications, more effectively reducing the work to block linear algebra (QR factorization and matrix multiplication). See FLATTER by Keegan Ryan and Nadia Heninger, and BLASTER by Léo Ducas, Ludo Pulles and Marc Stevens. It would be interesting to implement these algorithms using FLINT's multiprecision nfloat linear algebra.
fredrikj.net |
Blog index |
RSS feed |
Follow me on Mastodon |
Become a sponsor