# Faster crypto-sized residue ring arithmetic in FLINT

April 18, 2024

The new mpn_mod module available in FLINT 3.2-dev implements residue rings $\mathbb{Z}/n\mathbb{Z}$ for "few-limb" moduli $n$, currently in the range $2^{64} \le n < 2^{1024}$. It offers far superior performance to FLINT's arbitrary-size fmpz_mod residue rings.

Each mpn_mod element is stored inline as a $k$-limb array where $k$ is the number of limbs in the modulus. The mpn_mod format can thus be thought of as an efficient multi-word generalization of nmod (nmod being specialized for single-word moduli $1 \le n < 2^{64}$).

I started working on this feature at the recent workshop and most of the functionality and optimizations are now in place, including fast linear algebra and polynomial arithmetic.

## Memory benefits

Like nmod, the mpn_mod format has the advantage that elements can be stored contiguously on the stack or in arrays without a separate heap allocation and pointer indirection for each element.

In contrast, fmpz_mod stores elements inline only when they happen to be a bit smaller than a single limb; larger values require two heap allocations: one for an mpz_t struct and the other for the limb data. To make matters worse, fmpz_mod often allocates twice the number of useful limbs in order to store unreduced products.

The benefits of mpn_mod are pronounced when one creates vectors, polynomials and matrices with many elements. Consider a program that executes the following steps:

1. Initialize three zeroed $2000 \times 2000$ matrices $A, B, C$ over $\mathbb{Z}/n\mathbb{Z}$
2. Fill $A$ and $B$ with random data
3. Set $C$ to the matrix product $A \times B$
4. Clear $A, B, C$

A test run results in the following memory usage with fmpz_mod and mpn_mod respectively:

 128-bit modulus After (1) After (2) During (3) After (3) After (4) fmpz_mod 96 MB 471 MB 1178 MB 721 MB 695 MB mpn_mod 187 MB 187 MB 229 MB 198 MB 15 MB 256-bit modulus After (1) After (2) During (3) After (3) After (4) fmpz_mod 96 MB 593 MB 1788 MB 964 MB 980 MB mpn_mod 371 MB 371 MB 1216 MB 372 MB 6 MB

At first, mpn_mod uses more memory than fmpz_mod to store the zero matrices. This does demonstrate that fmpz_mod can be more efficient when one has structured data. However, as soon as we fill in dense data, mpn_mod becomes much more economical.

Temporarily during step (3), varying amounts of extra memory are allocated depending on the matrix multiplication algorithm. With a 128-bit modulus, fmpz_mod uses multimodular multiplication while mpn_mod does a less memory-hungry Strassen multiplication in the original representation. With a 256-bit modulus, both fmpz_mod and mpn_mod use multimodular multiplication requiring a similar amount of extra memory, though the overall consumption is lower for mpn_mod since the input and output matrices use less space.

Once we have finished the matrix multiplication, fmpz_mod has allocated space in $C$ to store the unreduced matrix product over $\mathbb{Z}$, while mpn_mod does no such overallocation. At this point the mpn_mod version of the program uses essentially no more memory than before the matrix multiplication, and only a fraction of the memory of the fmpz_mod version.

After clearing the matrices, the fmpz_mod version of the program still uses a lot of memory. This is due to the fact that the fmpz_t type maintains a cache of mpz_t variables for quick reallocation. To free this cache, one must call flint_cleanup() or flint_cleanup_master(). Since mpn_mod has no need for such a cache, we reclaim all memory as soon as we clear the matrices.

## Speed benefits

Both mpn_mod and fmpz_mod use many of the same FLINT and GMP mpn functions internally, including routines for things like mod $n$ reduction using precomputed inverses (recently made faster thanks to assembly code by Albin Ahlbäck). However, mpn_mod functions are significantly faster than their fmpz_mod counterparts due to a combination of factors:

• There is no bookkeeping overhead to deal with variable-size entries
• We can have tighter loops for vector operations (addition, vector-scalar multiplication, dot products, etc.)
• There is better memory locality
• It is cheaper to create and destroy temporary variables
• Algorithm cutoffs (e.g. between two matrix multiplication algorithms) can be tuned more precisely for the problem, since there are fewer free parameters to worry about

Whereas fmpz_mod simply wraps fmpz_poly and fmpz_mat functions for asymptotically fast polynomial and matrix multiplication, mpn_mod reimplements those algorithms (Karatsuba, Kronecker substitution and multimodular multiplication for polynomials; Waksman and multimodular multiplication for matrices) in mpn format to cut down on overheads.

A table of benchmark data for some common operations on matrices and polynomials is attached at the end of this post. For a couple of quick examples, here is a plot of the speedup of mpn_mod over fmpz_mod for solving a dense linear system $Ax = b$ of dimension $n$, for prime moduli having 128, 256, 512 and 1024 bits:

Small moduli and medium-sized matrices benefit the most, with more than a 3x speedup for 128-bit moduli in some ranges. Linear solving reduces asymptotically to matrix multiplication which is quite well optimized for fmpz_mod when the matrices get large enough, but the overheads are noticeable in practice; even for matrices of size $n = 1024$ mpn_mod yields a factor 1.5 to 1.2 improvement.

Similarly, here is a plot of speedups for polynomial GCD:

## Using generics

The mpn_mod module is designed from the ground up to use the generics system introduced in FLINT 3.0. Compared to legacy FLINT types, this requires much less code as we get most of the higher level functionality for free. For example, there is no need for mpn_mod_poly and mpn_mod_mat modules; we simply use gr_poly and gr_mat with generic algorithms on top of some optimized core routines (dot products, matrix multiplication, etc.). Indeed, one of the express goals of the generics system was to facilitate adding specialized implementations of this kind to achieve better performance than "one-size-fits-all" solutions like fmpz_mod.

## Possible applications

Moduli with a few hundred bits are common in cryptography, so chances that someone will find this feature useful in cryptographic research. There are probably some number theoretic applications that will benefit as well. If someone has a concrete application where mpn_mod is useful, I would love to hear about it!

There are also various things in FLINT that could be accelerated in the future with the help of mpn_mod:

• Finite extension fields $F_{p^k}$ for multi-word $p$ (building on mpn_mod for the $k = 1$ case)
• Primality tests and factoring algorithms (though some of these already use mpn arithmetic internally)
• Polynomial factorisation over $\mathbb{Z}$, possibly, through faster Hensel lifting
• fmpz_mod functions, by converting to mpn_mod internally in some cases

## Benchmark data

All results are with a single thread, with FLINT configured without BLAS support.

Speedup of mpn_mod versus fmpz_mod for $n \times n$ matrix addition, multiplication, and solving a linear system $Ax = b$, with a modulus of the specified number of bits:

$build/mpn_mod/profile/p-mat_vs_fmpz_mod bits n add mul solve 80 1 1.633 3.970 1.405 80 2 2.342 1.841 1.719 80 4 3.316 1.591 2.127 80 8 4.170 1.470 2.698 80 16 4.326 1.319 2.909 80 32 3.725 1.178 3.427 80 64 5.222 1.121 3.204 80 128 2.997 1.216 2.772 80 256 2.145 1.506 2.254 80 512 4.090 1.114 2.043 80 1024 3.193 1.056 1.708 80 2048 2.964 1.038 1.506 bits n add mul solve 128 1 1.597 3.799 1.290 128 2 2.412 2.078 1.421 128 4 3.207 1.728 1.749 128 8 4.219 1.483 2.387 128 16 5.254 1.326 2.825 128 32 4.595 1.182 3.398 128 64 6.066 1.138 3.379 128 128 5.765 1.093 2.888 128 256 1.932 1.154 2.169 128 512 3.319 1.274 1.876 128 1024 3.656 1.227 1.651 128 2048 6.647 1.196 1.527 bits n add mul solve 180 1 1.890 2.339 1.535 180 2 3.531 1.581 1.486 180 4 3.309 1.725 1.551 180 8 4.194 1.582 1.893 180 16 4.396 1.614 2.352 180 32 4.440 1.703 2.593 180 64 4.093 1.163 2.693 180 128 3.500 1.075 2.224 180 256 3.967 1.068 1.841 180 512 4.880 1.054 1.609 180 1024 7.675 1.049 1.494 180 2048 9.446 1.031 1.350 bits n add mul solve 256 1 1.812 2.361 1.239 256 2 2.603 1.636 1.302 256 4 4.133 1.712 1.477 256 8 3.948 1.469 1.723 256 16 4.600 1.488 2.052 256 32 4.785 1.557 2.253 256 64 5.694 1.052 2.344 256 128 4.313 1.060 1.956 256 256 5.268 1.076 1.617 256 512 5.829 1.038 1.474 256 1024 6.679 1.032 1.383 256 2048 6.395 1.031 1.301 bits n add mul solve 512 1 2.086 1.963 1.159 512 2 2.590 1.522 1.193 512 4 3.146 1.402 1.290 512 8 3.743 1.457 1.473 512 16 4.059 1.563 1.670 512 32 4.244 1.619 1.752 512 64 4.325 1.137 1.755 512 128 3.581 1.071 1.559 512 256 5.437 1.076 1.482 512 512 5.116 1.046 1.359 512 1024 7.227 1.042 1.305 512 2048 8.393 1.048 1.235 bits n add mul solve 1024 1 1.946 1.311 1.118 1024 2 2.493 1.166 1.109 1024 4 2.354 1.209 1.152 1024 8 3.448 1.215 1.246 1024 16 3.264 1.274 1.250 1024 32 4.373 1.259 1.333 1024 64 3.217 1.102 1.358 1024 128 4.729 1.244 1.269 1024 256 5.141 1.061 1.209 1024 512 4.195 1.080 1.254 1024 1024 4.070 1.028 1.221 1024 2048 6.267 1.020 1.221  Speedup of mpn_mod versus fmpz_mod for operations on length$n$polynomials: $ build/mpn_mod/profile/p-poly_vs_fmpz_mod
bits    n   mul    sqr    mullow sqrlow divrem gcd
80     1   1.667  4.167  4.310  4.167  1.933  0.788
80     2   3.766  3.125  3.474  2.700  2.538  2.510
80     4   3.026  2.593  3.021  2.545  2.667  2.551
80     8   3.125  2.775  4.000  3.348  3.487  2.680
80    16   2.559  2.354  3.370  2.865  4.570  2.815
80    32   2.065  2.025  2.817  2.661  3.054  2.979
80    64   1.928  1.817  2.326  2.375  3.495  3.117
80   128   1.523  1.537  1.442  1.453  2.545  3.317
80   256   1.469  1.573  1.423  1.451  1.826  3.451
80   512   1.500  1.560  1.429  1.457  1.735  3.268
80  1024   1.530  1.598  1.480  1.479  1.690  3.057
80  2048   1.705  1.691  1.602  1.558  1.833  2.901
80  4096   1.836  1.810  1.833  1.733  1.782  2.625
80  8192   1.723  1.690  1.759  1.651  1.694  2.473
bits    n   mul    sqr    mullow sqrlow divrem gcd
128     1   3.983  3.898  3.950  4.052  1.703  0.815
128     2   3.783  3.085  3.459  2.816  2.205  2.039
128     4   3.056  2.500  3.114  2.584  2.288  1.975
128     8   3.476  3.171  4.557  3.777  2.947  2.054
128    16   2.920  2.689  4.033  3.704  3.934  2.302
128    32   2.235  2.337  3.471  3.409  3.022  2.533
128    64   1.741  1.932  2.870  2.973  2.985  2.794
128   128   1.316  1.316  1.269  1.388  2.578  3.049
128   256   1.311  1.324  1.324  1.306  1.938  3.107
128   512   1.360  1.368  1.314  1.327  1.619  2.863
128  1024   1.429  1.401  1.417  1.359  1.621  2.672
128  2048   1.553  1.538  1.500  1.435  1.694  2.590
128  4096   1.766  1.690  1.787  1.674  1.767  2.341
128  8192   1.772  1.708  1.835  1.773  1.745  2.211
bits    n   mul    sqr    mullow sqrlow divrem gcd
180     1   2.176  2.355  2.242  2.355  1.614  0.815
180     2   2.452  1.957  2.215  1.837  1.750  1.804
180     4   2.097  1.906  2.074  1.958  1.977  1.818
180     8   2.379  2.338  3.277  2.976  2.215  1.849
180    16   1.868  1.980  2.782  2.721  2.263  1.810
180    32   1.885  1.614  2.044  2.361  1.952  1.854
180    64   1.745  1.636  1.582  1.817  1.727  1.860
180   128   1.279  1.303  1.242  1.264  1.522  1.896
180   256   1.310  1.303  1.286  1.289  1.373  2.189
180   512   1.310  1.287  1.299  1.261  1.453  2.146
180  1024   1.400  1.377  1.450  1.367  1.528  2.062
180  2048   1.528  1.421  1.585  1.447  1.622  1.970
180  4096   1.664  1.557  1.745  1.608  1.711  1.910
180  8192   1.633  1.538  1.770  1.617  1.688  1.814
bits    n   mul    sqr    mullow sqrlow divrem gcd
256     1   2.171  2.406  2.143  2.437  1.429  0.815
256     2   2.153  2.252  2.065  2.139  1.585  1.533
256     4   1.821  2.031  1.850  2.063  1.664  1.545
256     8   2.254  2.551  3.254  3.674  1.872  1.613
256    16   1.882  2.023  2.713  2.910  2.000  1.684
256    32   1.660  1.557  2.138  2.366  1.838  1.750
256    64   1.357  1.402  1.473  1.851  1.529  1.840
256   128   1.138  1.154  1.105  1.065  1.294  1.820
256   256   1.143  1.163  1.112  1.112  1.214  1.911
256   512   1.176  1.194  1.107  1.113  1.221  1.812
256  1024   1.138  1.152  1.098  1.128  1.194  1.828
256  2048   1.174  1.175  1.120  1.120  1.163  1.612
256  4096   1.205  1.191  1.135  1.127  1.167  1.602
256  8192   1.269  1.225  1.239  1.143  1.265  1.523
bits    n   mul    sqr    mullow sqrlow divrem gcd
512     1   1.875  1.763  1.861  1.740  1.261  0.841
512     2   1.797  1.780  1.720  1.735  1.363  1.323
512     4   1.500  1.588  1.510  1.561  1.452  1.353
512     8   2.651  2.703  4.338  4.242  1.568  1.365
512    16   2.143  2.000  2.911  3.100  1.611  1.410
512    32   1.561  1.525  1.807  2.125  1.600  1.467
512    64   1.197  1.156  1.204  1.375  1.250  1.459
512   128   1.114  1.135  1.080  1.070  1.382  1.516
512   256   1.110  1.119  1.074  1.073  1.124  1.532
512   512   1.111  1.116  1.104  1.079  1.107  1.493
512  1024   1.104  1.146  1.087  1.075  1.122  1.486
512  2048   1.133  1.147  1.089  1.091  1.105  1.404
512  4096   1.245  1.158  1.182  1.088  1.158  1.365
512  8192   1.346  1.316  1.250  1.186  1.211  1.328
bits    n   mul    sqr    mullow sqrlow divrem gcd
1024     1   1.316  1.280  1.316  1.306  1.121  0.815
1024     2   1.253  1.299  1.246  1.321  1.225  1.200
1024     4   1.197  1.211  1.169  1.229  1.266  1.191
1024     8   2.026  2.068  1.255  1.500  1.250  1.205
1024    16   1.435  1.405  1.536  1.806  1.179  1.202
1024    32   0.984  1.116  1.000  1.182  1.039  1.143
1024    64   1.046  0.990  1.035  0.966  1.171  1.183
1024   128   1.060  1.076  1.038  1.044  1.033  1.175
1024   256   1.054  1.091  1.020  1.026  1.051  1.229
1024   512   1.059  1.074  1.048  1.037  1.048  1.243
1024  1024   1.066  1.067  1.041  1.048  1.046  1.213
1024  2048   1.098  1.077  1.089  1.029  1.094  1.203
1024  4096   1.204  1.185  1.134  1.113  1.132  1.190
1024  8192   1.174  1.220  1.123  1.157  1.122  1.171