fredrikj.net / blog /

# FLINT furnished with faster FFT

*April 14, 2023*

There is a lot to say about the upcoming 3.0 release of FLINT, most notably the fact that FLINT, Arb, Antic, Calcium and Generic-Rings have all been merged into the same library. That probably deserves at least one dedicated blog post, but today's post will be a sneak peek of an entirely different major new feature that recently got merged into FLINT: a blazing-fast small-prime FFT by Dan Schultz, enabling much faster arithmetic with huge numbers and polynomials (in some cases 10x faster than GMP).

Contents

## Overview

The new `fft_small` module implements number-theoretic transforms (NTTs),
i.e. FFTs in $\mathbf{Z}/p\mathbf{Z}$, for word-size $p$.
Specifically, $p$ is limited to 50 bits which allows
doing the mod $p$ arithmetic efficiently using
SIMD-vectorized `double` fused multiply-adds.
Convolutions with larger coefficients can be performed using Chinese
remaindering.
Everything is multithreaded.

In contrast, FLINT's old `fft` module
and GMP's FFT both implement the
Schönhage-Strassen algorithm (SSA).
SSA is most useful for multiplying polynomials with
huge coefficients: it can be used for integer multiplication
and for small-coefficient polynomial multiplication (using Kronecker substitution),
but on modern hardware with vectorized
full-word multipliers, NTTs typically beat SSA for those tasks.
Having an NTT in FLINT is thus long overdue.

Currently, the FLINT small-prime FFT requires either an x86-64 CPU
with AVX2 support or an arm64 CPU with NEON support.
It must be enabled manually when building FLINT (with `--enable-avx2`
or `--enable-neon`); otherwise, FLINT will simply fall
back to the classical SSA or GMP functions.

## What is affected?

So far, the small-prime FFT is only used for a few operations: integer multiplication (for numbers larger than about 10,000 digits), and Arb arithmetic (multiplication, division, and square roots) starting from roughly 10,000 digits of precision.

Operations that reduce to FLINT-based integer multiplication or ball arithmetic benefit indirectly, but operations in FLINT that currently call GMP (e.g. integer GCD) do not benefit yet.

The plan is to eventually use the small-prime FFT much more pervasively, most importantly for arithmetic in $(\mathbf{Z}/p\mathbf{Z})[x]$, but expect this progress to be gradual: it will take a lot of work to roll out algorithms based on the new FFT everywhere!

## Integer multiplication profile

Here is a quick look at integer multiplication performance on my laptop
which has an 8-core AMD Ryzen 7 PRO 5850U CPU (Zen 3).
The plot shows the speedup of FLINT's `flint_mpn_mul` over GMP's `mpn_mul`
for multiplying two $n$-digits integers.

The improvement is quite good even for numbers in the moderate 10,000 - 100,000 digit range.

For huge operands, the new FFT is more than 3x faster than GMP or the old FLINT SSA FFT when using a single core. Using all eight cores, it is 10x faster in a sweet spot around a million digits, and asymptotically about 7x faster than GMP or 2x faster than FLINT's SSA.

## Computing pi

For a slightly more complex task, we look at the time to compute pi to $n$ digits. The standard algorithm uses binary splitting evaluation of the Chudnovsky series. This exercises arithmetic at all sizes up to $n$ digits fairly evenly, so one should expect smaller speedups than for doing isolated $n$-digit multiplications. I've compared the following five implementations:

- gmp-chudnovsky: an optimized implementation of the Chudnovsky series using plain GMP.
- y-cruncher: the fastest special-purpose pi program, used to set all recent world records.
- MPFR: calling
`mpfr_const_pi`. This is the only tested implementation that uses the AGM method instead of the Chudnovsky series. - FLINT 2: calling
`arb_const_pi`with the old SSA-based arithmetic in FLINT 2 / Arb. - FLINT 3: calling
`arb_const_pi`in FLINT 3 with the new FFT enabled.

Timings using a single thread:

Digits | gmp-chudnovsky.c | MPFR | y-cruncher | FLINT 2 | FLINT 3 |
---|---|---|---|---|---|

1,000,000 | 0.231 s | 0.579 s | 0.073 s | 0.257 s | 0.144 s |

10,000,000 | 3.91 s | 10.45 s | 3.94 s | 2.03 s | |

100,000,000 | 62.1 s | 172.9 s | 18.3 s | 69.3 s | 31.6 s |

Using 8 threads (supported by y-cruncher and FLINT), and going all the way up to 1 billion digits:

Digits | y-cruncher | FLINT 2 | FLINT 3 |
---|---|---|---|

1,000,000 | 0.032 s | 0.159 s | 0.057 s |

10,000,000 | 1.61 s | 0.67 s | |

100,000,000 | 4.45 s | 22.4 s | 9.34 s |

1,000,000,000 | 63.1 s | 318.9 s | 141.5 s |

The new FFT speeds up computing pi by 2x-3x, which is not bad since we have simply optimized the underlying arithmetic without making any changes to the implementation of the Chudnovsky algorithm. Indeed, many other transcendental functions ($\exp(x)$, $\log(x)$, etc.) are now faster by a similar factor as a result of the faster arithmetic.

I'm pretty happy to reach half the speed of y-cruncher (and more than 6x the speed of gmp-chudnovsky), considering that the FLINT/Arb implementation is fairly generic: it does not use gmp-chudnovsky's sieving trick nor does it optimize the binary splitting basecase, use the middle product trick for division, or reuse transforms; the parallel scheduling could also most likely be improved. In any case, I should note that FLINT is designed exclusively for in-memory computations. On my laptop, it is not possible to compute more than about a billion digits before running out of memory. In contrast, y-cruncher can work efficiently with operands on disk. I don't expect anyone to implement a disk mode for FLINT any time soon.

## Complex modular forms

For another benchmark of transcendental function performance,
here are timings to compute the Dedekind eta function value $\eta(ix)$ where $x = \sqrt{2}-1$.
For this we simply call `acb_modular_eta`.
Internally this involves the evaluation of a complex exponential
and a $q$-series.

Digits | FLINT 2 (1 thread) | FLINT 3 (1 thread) |
---|---|---|

100,000 | 0.109 s | 0.0613 s |

1,000,000 | 2.98 s | 1.32 s |

10,000,000 | 99.7 s | 37.6 s |

Digits | FLINT 2 (8 threads) | FLINT 3 (8 threads) |

100,000 | 0.09 s | 0.048 s |

1,000,000 | 1.82 s | 0.432 s |

10,000,000 | 33 s | 18.5 s |

Speedups are in the ballpark of 2x but can be as large as 4x (1 million digits with 8 threads).

## Credits

The small-prime FFT is the latest contribution to FLINT by the brilliant Dan Schultz who worked nearly full-time on FLINT development until 2022 but now (sadly for us!) has moved on to an entirely different job. I will use this post as an opportunity to thank Dan in public for his exceptional contributions to the FLINT project over the years.

I will also give a shoutout to Albin Ahlbäck who helped with merging the code and adapting the build system for the new FFT. My own contributions have been limited to some trivial integration work and writing Newton-based division and square root code for Arb arithmetic.

Any help from the community testing, tuning or employing the new FFT would be highly appreciated. As I already mentioned, what will have a big impact on FLINT's "real-world" performance is to put it to use for polynomial arithmetic. More on that in the near future, hopefully.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor