Faster crypto-sized residue ring arithmetic in FLINT

April 18, 2024

The new mpn_mod module available in FLINT 3.2-dev implements residue rings $\mathbb{Z}/n\mathbb{Z}$ for "few-limb" moduli $n$, currently in the range $2^{64} \le n < 2^{1024}$. It offers far superior performance to FLINT's arbitrary-size fmpz_mod residue rings.

Each mpn_mod element is stored inline as a $k$-limb array where $k$ is the number of limbs in the modulus. The mpn_mod format can thus be thought of as an efficient multi-word generalization of nmod (nmod being specialized for single-word moduli $1 \le n < 2^{64}$).

I started working on this feature at the recent workshop and most of the functionality and optimizations are now in place, including fast linear algebra and polynomial arithmetic.

Contents

Memory benefits
Speed benefits
Using generics
Possible applications
Benchmark data

Memory benefits

Like nmod, the mpn_mod format has the advantage that elements can be stored contiguously on the stack or in arrays without a separate heap allocation and pointer indirection for each element.

In contrast, fmpz_mod stores elements inline only when they happen to be a bit smaller than a single limb; larger values require two heap allocations: one for an mpz_t struct and the other for the limb data. To make matters worse, fmpz_mod often allocates twice the number of useful limbs in order to store unreduced products.

The benefits of mpn_mod are pronounced when one creates vectors, polynomials and matrices with many elements. Consider a program that executes the following steps:

Initialize three zeroed $2000 \times 2000$ matrices $A, B, C$ over $\mathbb{Z}/n\mathbb{Z}$
Fill $A$ and $B$ with random data
Set $C$ to the matrix product $A \times B$
Clear $A, B, C$

A test run results in the following memory usage with fmpz_mod and mpn_mod respectively:

128-bit modulus
	After (1)	After (2)	During (3)	After (3)	After (4)
`fmpz_mod`	96 MB	471 MB	1178 MB	721 MB	695 MB
`mpn_mod`	187 MB	187 MB	229 MB	198 MB	15 MB
256-bit modulus
	After (1)	After (2)	During (3)	After (3)	After (4)
`fmpz_mod`	96 MB	593 MB	1788 MB	964 MB	980 MB
`mpn_mod`	371 MB	371 MB	1216 MB	372 MB	6 MB

At first, mpn_mod uses more memory than fmpz_mod to store the zero matrices. This does demonstrate that fmpz_mod can be more efficient when one has structured data. However, as soon as we fill in dense data, mpn_mod becomes much more economical.

Temporarily during step (3), varying amounts of extra memory are allocated depending on the matrix multiplication algorithm. With a 128-bit modulus, fmpz_mod uses multimodular multiplication while mpn_mod does a less memory-hungry Strassen multiplication in the original representation. With a 256-bit modulus, both fmpz_mod and mpn_mod use multimodular multiplication requiring a similar amount of extra memory, though the overall consumption is lower for mpn_mod since the input and output matrices use less space.

Once we have finished the matrix multiplication, fmpz_mod has allocated space in $C$ to store the unreduced matrix product over $\mathbb{Z}$, while mpn_mod does no such overallocation. At this point the mpn_mod version of the program uses essentially no more memory than before the matrix multiplication, and only a fraction of the memory of the fmpz_mod version.

After clearing the matrices, the fmpz_mod version of the program still uses a lot of memory. This is due to the fact that the fmpz_t type maintains a cache of mpz_t variables for quick reallocation. To free this cache, one must call flint_cleanup() or flint_cleanup_master(). Since mpn_mod has no need for such a cache, we reclaim all memory as soon as we clear the matrices.

Speed benefits

Both mpn_mod and fmpz_mod use many of the same FLINT and GMP mpn functions internally, including routines for things like mod $n$ reduction using precomputed inverses (recently made faster thanks to assembly code by Albin Ahlbäck). However, mpn_mod functions are significantly faster than their fmpz_mod counterparts due to a combination of factors:

There is no bookkeeping overhead to deal with variable-size entries
We can have tighter loops for vector operations (addition, vector-scalar multiplication, dot products, etc.)
There is better memory locality
It is cheaper to create and destroy temporary variables
Algorithm cutoffs (e.g. between two matrix multiplication algorithms) can be tuned more precisely for the problem, since there are fewer free parameters to worry about

Whereas fmpz_mod simply wraps fmpz_poly and fmpz_mat functions for asymptotically fast polynomial and matrix multiplication, mpn_mod reimplements those algorithms (Karatsuba, Kronecker substitution and multimodular multiplication for polynomials; Waksman and multimodular multiplication for matrices) in mpn format to cut down on overheads.

A table of benchmark data for some common operations on matrices and polynomials is attached at the end of this post. For a couple of quick examples, here is a plot of the speedup of mpn_mod over fmpz_mod for solving a dense linear system $Ax = b$ of dimension $n$, for prime moduli having 128, 256, 512 and 1024 bits:

Small moduli and medium-sized matrices benefit the most, with more than a 3x speedup for 128-bit moduli in some ranges. Linear solving reduces asymptotically to matrix multiplication which is quite well optimized for fmpz_mod when the matrices get large enough, but the overheads are noticeable in practice; even for matrices of size $n = 1024$ mpn_mod yields a factor 1.5 to 1.2 improvement.

Similarly, here is a plot of speedups for polynomial GCD:

Using generics

The mpn_mod module is designed from the ground up to use the generics system introduced in FLINT 3.0. Compared to legacy FLINT types, this requires much less code as we get most of the higher level functionality for free. For example, there is no need for mpn_mod_poly and mpn_mod_mat modules; we simply use gr_poly and gr_mat with generic algorithms on top of some optimized core routines (dot products, matrix multiplication, etc.). Indeed, one of the express goals of the generics system was to facilitate adding specialized implementations of this kind to achieve better performance than "one-size-fits-all" solutions like fmpz_mod.

Possible applications

Moduli with a few hundred bits are common in cryptography, so chances that someone will find this feature useful in cryptographic research. There are probably some number theoretic applications that will benefit as well. If someone has a concrete application where mpn_mod is useful, I would love to hear about it!

There are also various things in FLINT that could be accelerated in the future with the help of mpn_mod:

Finite extension fields $F_{p^k}$ for multi-word $p$ (building on mpn_mod for the $k = 1$ case)
Primality tests and factoring algorithms (though some of these already use mpn arithmetic internally)
Polynomial factorisation over $\mathbb{Z}$, possibly, through faster Hensel lifting
p-adic arithmetic
fmpz_mod functions, by converting to mpn_mod internally in some cases

Benchmark data

All results are with a single thread, with FLINT configured without BLAS support.

Speedup of mpn_mod versus fmpz_mod for $n \times n$ matrix addition, multiplication, and solving a linear system $Ax = b$, with a modulus of the specified number of bits:

$ build/mpn_mod/profile/p-mat_vs_fmpz_mod 
  bits    n   add    mul    solve
   80     1   1.633  3.970  1.405
   80     2   2.342  1.841  1.719
   80     4   3.316  1.591  2.127
   80     8   4.170  1.470  2.698
   80    16   4.326  1.319  2.909
   80    32   3.725  1.178  3.427
   80    64   5.222  1.121  3.204
   80   128   2.997  1.216  2.772
   80   256   2.145  1.506  2.254
   80   512   4.090  1.114  2.043
   80  1024   3.193  1.056  1.708
   80  2048   2.964  1.038  1.506
  bits    n   add    mul    solve
  128     1   1.597  3.799  1.290
  128     2   2.412  2.078  1.421
  128     4   3.207  1.728  1.749
  128     8   4.219  1.483  2.387
  128    16   5.254  1.326  2.825
  128    32   4.595  1.182  3.398
  128    64   6.066  1.138  3.379
  128   128   5.765  1.093  2.888
  128   256   1.932  1.154  2.169
  128   512   3.319  1.274  1.876
  128  1024   3.656  1.227  1.651
  128  2048   6.647  1.196  1.527
  bits    n   add    mul    solve
  180     1   1.890  2.339  1.535
  180     2   3.531  1.581  1.486
  180     4   3.309  1.725  1.551
  180     8   4.194  1.582  1.893
  180    16   4.396  1.614  2.352
  180    32   4.440  1.703  2.593
  180    64   4.093  1.163  2.693
  180   128   3.500  1.075  2.224
  180   256   3.967  1.068  1.841
  180   512   4.880  1.054  1.609
  180  1024   7.675  1.049  1.494
  180  2048   9.446  1.031  1.350
  bits    n   add    mul    solve
  256     1   1.812  2.361  1.239
  256     2   2.603  1.636  1.302
  256     4   4.133  1.712  1.477
  256     8   3.948  1.469  1.723
  256    16   4.600  1.488  2.052
  256    32   4.785  1.557  2.253
  256    64   5.694  1.052  2.344
  256   128   4.313  1.060  1.956
  256   256   5.268  1.076  1.617
  256   512   5.829  1.038  1.474
  256  1024   6.679  1.032  1.383
  256  2048   6.395  1.031  1.301
  bits    n   add    mul    solve
  512     1   2.086  1.963  1.159
  512     2   2.590  1.522  1.193
  512     4   3.146  1.402  1.290
  512     8   3.743  1.457  1.473
  512    16   4.059  1.563  1.670
  512    32   4.244  1.619  1.752
  512    64   4.325  1.137  1.755
  512   128   3.581  1.071  1.559
  512   256   5.437  1.076  1.482
  512   512   5.116  1.046  1.359
  512  1024   7.227  1.042  1.305
  512  2048   8.393  1.048  1.235
  bits    n   add    mul    solve
 1024     1   1.946  1.311  1.118
 1024     2   2.493  1.166  1.109
 1024     4   2.354  1.209  1.152
 1024     8   3.448  1.215  1.246
 1024    16   3.264  1.274  1.250
 1024    32   4.373  1.259  1.333
 1024    64   3.217  1.102  1.358
 1024   128   4.729  1.244  1.269
 1024   256   5.141  1.061  1.209
 1024   512   4.195  1.080  1.254
 1024  1024   4.070  1.028  1.221
 1024  2048   6.267  1.020  1.221

Speedup of mpn_mod versus fmpz_mod for operations on length $n$ polynomials:

$ build/mpn_mod/profile/p-poly_vs_fmpz_mod 
  bits    n   mul    sqr    mullow sqrlow divrem gcd
   80     1   1.667  4.167  4.310  4.167  1.933  0.788
   80     2   3.766  3.125  3.474  2.700  2.538  2.510
   80     4   3.026  2.593  3.021  2.545  2.667  2.551
   80     8   3.125  2.775  4.000  3.348  3.487  2.680
   80    16   2.559  2.354  3.370  2.865  4.570  2.815
   80    32   2.065  2.025  2.817  2.661  3.054  2.979
   80    64   1.928  1.817  2.326  2.375  3.495  3.117
   80   128   1.523  1.537  1.442  1.453  2.545  3.317
   80   256   1.469  1.573  1.423  1.451  1.826  3.451
   80   512   1.500  1.560  1.429  1.457  1.735  3.268
   80  1024   1.530  1.598  1.480  1.479  1.690  3.057
   80  2048   1.705  1.691  1.602  1.558  1.833  2.901
   80  4096   1.836  1.810  1.833  1.733  1.782  2.625
   80  8192   1.723  1.690  1.759  1.651  1.694  2.473
  bits    n   mul    sqr    mullow sqrlow divrem gcd
  128     1   3.983  3.898  3.950  4.052  1.703  0.815
  128     2   3.783  3.085  3.459  2.816  2.205  2.039
  128     4   3.056  2.500  3.114  2.584  2.288  1.975
  128     8   3.476  3.171  4.557  3.777  2.947  2.054
  128    16   2.920  2.689  4.033  3.704  3.934  2.302
  128    32   2.235  2.337  3.471  3.409  3.022  2.533
  128    64   1.741  1.932  2.870  2.973  2.985  2.794
  128   128   1.316  1.316  1.269  1.388  2.578  3.049
  128   256   1.311  1.324  1.324  1.306  1.938  3.107
  128   512   1.360  1.368  1.314  1.327  1.619  2.863
  128  1024   1.429  1.401  1.417  1.359  1.621  2.672
  128  2048   1.553  1.538  1.500  1.435  1.694  2.590
  128  4096   1.766  1.690  1.787  1.674  1.767  2.341
  128  8192   1.772  1.708  1.835  1.773  1.745  2.211
  bits    n   mul    sqr    mullow sqrlow divrem gcd
  180     1   2.176  2.355  2.242  2.355  1.614  0.815
  180     2   2.452  1.957  2.215  1.837  1.750  1.804
  180     4   2.097  1.906  2.074  1.958  1.977  1.818
  180     8   2.379  2.338  3.277  2.976  2.215  1.849
  180    16   1.868  1.980  2.782  2.721  2.263  1.810
  180    32   1.885  1.614  2.044  2.361  1.952  1.854
  180    64   1.745  1.636  1.582  1.817  1.727  1.860
  180   128   1.279  1.303  1.242  1.264  1.522  1.896
  180   256   1.310  1.303  1.286  1.289  1.373  2.189
  180   512   1.310  1.287  1.299  1.261  1.453  2.146
  180  1024   1.400  1.377  1.450  1.367  1.528  2.062
  180  2048   1.528  1.421  1.585  1.447  1.622  1.970
  180  4096   1.664  1.557  1.745  1.608  1.711  1.910
  180  8192   1.633  1.538  1.770  1.617  1.688  1.814
  bits    n   mul    sqr    mullow sqrlow divrem gcd
  256     1   2.171  2.406  2.143  2.437  1.429  0.815
  256     2   2.153  2.252  2.065  2.139  1.585  1.533
  256     4   1.821  2.031  1.850  2.063  1.664  1.545
  256     8   2.254  2.551  3.254  3.674  1.872  1.613
  256    16   1.882  2.023  2.713  2.910  2.000  1.684
  256    32   1.660  1.557  2.138  2.366  1.838  1.750
  256    64   1.357  1.402  1.473  1.851  1.529  1.840
  256   128   1.138  1.154  1.105  1.065  1.294  1.820
  256   256   1.143  1.163  1.112  1.112  1.214  1.911
  256   512   1.176  1.194  1.107  1.113  1.221  1.812
  256  1024   1.138  1.152  1.098  1.128  1.194  1.828
  256  2048   1.174  1.175  1.120  1.120  1.163  1.612
  256  4096   1.205  1.191  1.135  1.127  1.167  1.602
  256  8192   1.269  1.225  1.239  1.143  1.265  1.523
  bits    n   mul    sqr    mullow sqrlow divrem gcd
  512     1   1.875  1.763  1.861  1.740  1.261  0.841
  512     2   1.797  1.780  1.720  1.735  1.363  1.323
  512     4   1.500  1.588  1.510  1.561  1.452  1.353
  512     8   2.651  2.703  4.338  4.242  1.568  1.365
  512    16   2.143  2.000  2.911  3.100  1.611  1.410
  512    32   1.561  1.525  1.807  2.125  1.600  1.467
  512    64   1.197  1.156  1.204  1.375  1.250  1.459
  512   128   1.114  1.135  1.080  1.070  1.382  1.516
  512   256   1.110  1.119  1.074  1.073  1.124  1.532
  512   512   1.111  1.116  1.104  1.079  1.107  1.493
  512  1024   1.104  1.146  1.087  1.075  1.122  1.486
  512  2048   1.133  1.147  1.089  1.091  1.105  1.404
  512  4096   1.245  1.158  1.182  1.088  1.158  1.365
  512  8192   1.346  1.316  1.250  1.186  1.211  1.328
  bits    n   mul    sqr    mullow sqrlow divrem gcd
 1024     1   1.316  1.280  1.316  1.306  1.121  0.815
 1024     2   1.253  1.299  1.246  1.321  1.225  1.200
 1024     4   1.197  1.211  1.169  1.229  1.266  1.191
 1024     8   2.026  2.068  1.255  1.500  1.250  1.205
 1024    16   1.435  1.405  1.536  1.806  1.179  1.202
 1024    32   0.984  1.116  1.000  1.182  1.039  1.143
 1024    64   1.046  0.990  1.035  0.966  1.171  1.183
 1024   128   1.060  1.076  1.038  1.044  1.033  1.175
 1024   256   1.054  1.091  1.020  1.026  1.051  1.229
 1024   512   1.059  1.074  1.048  1.037  1.048  1.243
 1024  1024   1.066  1.067  1.041  1.048  1.046  1.213
 1024  2048   1.098  1.077  1.089  1.029  1.094  1.203
 1024  4096   1.204  1.185  1.134  1.113  1.132  1.190
 1024  8192   1.174  1.220  1.123  1.157  1.122  1.171

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor