fredrikj.net / blog /
Faster crypto-sized residue ring arithmetic in FLINT
April 18, 2024
The new mpn_mod module available in FLINT 3.2-dev implements residue rings $\mathbb{Z}/n\mathbb{Z}$ for "few-limb" moduli $n$, currently in the range $2^{64} \le n < 2^{1024}$. It offers far superior performance to FLINT's arbitrary-size fmpz_mod residue rings.
Each mpn_mod element is stored inline as a $k$-limb array where $k$ is the number of limbs in the modulus. The mpn_mod format can thus be thought of as an efficient multi-word generalization of nmod (nmod being specialized for single-word moduli $1 \le n < 2^{64}$).
I started working on this feature at the recent workshop and most of the functionality and optimizations are now in place, including fast linear algebra and polynomial arithmetic.
Memory benefits
Like nmod, the mpn_mod format has the advantage that elements can be stored contiguously on the stack or in arrays without a separate heap allocation and pointer indirection for each element.
In contrast, fmpz_mod stores elements inline only when they happen to be a bit smaller than a single limb; larger values require two heap allocations: one for an mpz_t struct and the other for the limb data. To make matters worse, fmpz_mod often allocates twice the number of useful limbs in order to store unreduced products.
The benefits of mpn_mod are pronounced when one creates vectors, polynomials and matrices with many elements. Consider a program that executes the following steps:
- Initialize three zeroed $2000 \times 2000$ matrices $A, B, C$ over $\mathbb{Z}/n\mathbb{Z}$
- Fill $A$ and $B$ with random data
- Set $C$ to the matrix product $A \times B$
- Clear $A, B, C$
A test run results in the following memory usage with fmpz_mod and mpn_mod respectively:
128-bit modulus | |||||
After (1) | After (2) | During (3) | After (3) | After (4) | |
fmpz_mod | 96 MB | 471 MB | 1178 MB | 721 MB | 695 MB |
mpn_mod | 187 MB | 187 MB | 229 MB | 198 MB | 15 MB |
256-bit modulus | |||||
After (1) | After (2) | During (3) | After (3) | After (4) | |
fmpz_mod | 96 MB | 593 MB | 1788 MB | 964 MB | 980 MB |
mpn_mod | 371 MB | 371 MB | 1216 MB | 372 MB | 6 MB |
At first, mpn_mod uses more memory than fmpz_mod to store the zero matrices. This does demonstrate that fmpz_mod can be more efficient when one has structured data. However, as soon as we fill in dense data, mpn_mod becomes much more economical.
Temporarily during step (3), varying amounts of extra memory are allocated depending on the matrix multiplication algorithm. With a 128-bit modulus, fmpz_mod uses multimodular multiplication while mpn_mod does a less memory-hungry Strassen multiplication in the original representation. With a 256-bit modulus, both fmpz_mod and mpn_mod use multimodular multiplication requiring a similar amount of extra memory, though the overall consumption is lower for mpn_mod since the input and output matrices use less space.
Once we have finished the matrix multiplication, fmpz_mod has allocated space in $C$ to store the unreduced matrix product over $\mathbb{Z}$, while mpn_mod does no such overallocation. At this point the mpn_mod version of the program uses essentially no more memory than before the matrix multiplication, and only a fraction of the memory of the fmpz_mod version.
After clearing the matrices, the fmpz_mod version of the program still uses a lot of memory. This is due to the fact that the fmpz_t type maintains a cache of mpz_t variables for quick reallocation. To free this cache, one must call flint_cleanup() or flint_cleanup_master(). Since mpn_mod has no need for such a cache, we reclaim all memory as soon as we clear the matrices.
Speed benefits
Both mpn_mod and fmpz_mod use many of the same FLINT and GMP mpn functions internally, including routines for things like mod $n$ reduction using precomputed inverses (recently made faster thanks to assembly code by Albin Ahlbäck). However, mpn_mod functions are significantly faster than their fmpz_mod counterparts due to a combination of factors:
- There is no bookkeeping overhead to deal with variable-size entries
- We can have tighter loops for vector operations (addition, vector-scalar multiplication, dot products, etc.)
- There is better memory locality
- It is cheaper to create and destroy temporary variables
- Algorithm cutoffs (e.g. between two matrix multiplication algorithms) can be tuned more precisely for the problem, since there are fewer free parameters to worry about
Whereas fmpz_mod simply wraps fmpz_poly and fmpz_mat functions for asymptotically fast polynomial and matrix multiplication, mpn_mod reimplements those algorithms (Karatsuba, Kronecker substitution and multimodular multiplication for polynomials; Waksman and multimodular multiplication for matrices) in mpn format to cut down on overheads.
A table of benchmark data for some common operations on matrices and polynomials is attached at the end of this post. For a couple of quick examples, here is a plot of the speedup of mpn_mod over fmpz_mod for solving a dense linear system $Ax = b$ of dimension $n$, for prime moduli having 128, 256, 512 and 1024 bits:
Small moduli and medium-sized matrices benefit the most, with more than a 3x speedup for 128-bit moduli in some ranges. Linear solving reduces asymptotically to matrix multiplication which is quite well optimized for fmpz_mod when the matrices get large enough, but the overheads are noticeable in practice; even for matrices of size $n = 1024$ mpn_mod yields a factor 1.5 to 1.2 improvement.
Similarly, here is a plot of speedups for polynomial GCD:
Using generics
The mpn_mod module is designed from the ground up to use the generics system introduced in FLINT 3.0. Compared to legacy FLINT types, this requires much less code as we get most of the higher level functionality for free. For example, there is no need for mpn_mod_poly and mpn_mod_mat modules; we simply use gr_poly and gr_mat with generic algorithms on top of some optimized core routines (dot products, matrix multiplication, etc.). Indeed, one of the express goals of the generics system was to facilitate adding specialized implementations of this kind to achieve better performance than "one-size-fits-all" solutions like fmpz_mod.
Possible applications
Moduli with a few hundred bits are common in cryptography, so chances that someone will find this feature useful in cryptographic research. There are probably some number theoretic applications that will benefit as well. If someone has a concrete application where mpn_mod is useful, I would love to hear about it!
There are also various things in FLINT that could be accelerated in the future with the help of mpn_mod:
- Finite extension fields $F_{p^k}$ for multi-word $p$ (building on mpn_mod for the $k = 1$ case)
- Primality tests and factoring algorithms (though some of these already use mpn arithmetic internally)
- Polynomial factorisation over $\mathbb{Z}$, possibly, through faster Hensel lifting
- p-adic arithmetic
- fmpz_mod functions, by converting to mpn_mod internally in some cases
Benchmark data
All results are with a single thread, with FLINT configured without BLAS support.
Speedup of mpn_mod versus fmpz_mod for $n \times n$ matrix addition, multiplication, and solving a linear system $Ax = b$, with a modulus of the specified number of bits:
$ build/mpn_mod/profile/p-mat_vs_fmpz_mod bits n add mul solve 80 1 1.633 3.970 1.405 80 2 2.342 1.841 1.719 80 4 3.316 1.591 2.127 80 8 4.170 1.470 2.698 80 16 4.326 1.319 2.909 80 32 3.725 1.178 3.427 80 64 5.222 1.121 3.204 80 128 2.997 1.216 2.772 80 256 2.145 1.506 2.254 80 512 4.090 1.114 2.043 80 1024 3.193 1.056 1.708 80 2048 2.964 1.038 1.506 bits n add mul solve 128 1 1.597 3.799 1.290 128 2 2.412 2.078 1.421 128 4 3.207 1.728 1.749 128 8 4.219 1.483 2.387 128 16 5.254 1.326 2.825 128 32 4.595 1.182 3.398 128 64 6.066 1.138 3.379 128 128 5.765 1.093 2.888 128 256 1.932 1.154 2.169 128 512 3.319 1.274 1.876 128 1024 3.656 1.227 1.651 128 2048 6.647 1.196 1.527 bits n add mul solve 180 1 1.890 2.339 1.535 180 2 3.531 1.581 1.486 180 4 3.309 1.725 1.551 180 8 4.194 1.582 1.893 180 16 4.396 1.614 2.352 180 32 4.440 1.703 2.593 180 64 4.093 1.163 2.693 180 128 3.500 1.075 2.224 180 256 3.967 1.068 1.841 180 512 4.880 1.054 1.609 180 1024 7.675 1.049 1.494 180 2048 9.446 1.031 1.350 bits n add mul solve 256 1 1.812 2.361 1.239 256 2 2.603 1.636 1.302 256 4 4.133 1.712 1.477 256 8 3.948 1.469 1.723 256 16 4.600 1.488 2.052 256 32 4.785 1.557 2.253 256 64 5.694 1.052 2.344 256 128 4.313 1.060 1.956 256 256 5.268 1.076 1.617 256 512 5.829 1.038 1.474 256 1024 6.679 1.032 1.383 256 2048 6.395 1.031 1.301 bits n add mul solve 512 1 2.086 1.963 1.159 512 2 2.590 1.522 1.193 512 4 3.146 1.402 1.290 512 8 3.743 1.457 1.473 512 16 4.059 1.563 1.670 512 32 4.244 1.619 1.752 512 64 4.325 1.137 1.755 512 128 3.581 1.071 1.559 512 256 5.437 1.076 1.482 512 512 5.116 1.046 1.359 512 1024 7.227 1.042 1.305 512 2048 8.393 1.048 1.235 bits n add mul solve 1024 1 1.946 1.311 1.118 1024 2 2.493 1.166 1.109 1024 4 2.354 1.209 1.152 1024 8 3.448 1.215 1.246 1024 16 3.264 1.274 1.250 1024 32 4.373 1.259 1.333 1024 64 3.217 1.102 1.358 1024 128 4.729 1.244 1.269 1024 256 5.141 1.061 1.209 1024 512 4.195 1.080 1.254 1024 1024 4.070 1.028 1.221 1024 2048 6.267 1.020 1.221
Speedup of mpn_mod versus fmpz_mod for operations on length $n$ polynomials:
$ build/mpn_mod/profile/p-poly_vs_fmpz_mod bits n mul sqr mullow sqrlow divrem gcd 80 1 1.667 4.167 4.310 4.167 1.933 0.788 80 2 3.766 3.125 3.474 2.700 2.538 2.510 80 4 3.026 2.593 3.021 2.545 2.667 2.551 80 8 3.125 2.775 4.000 3.348 3.487 2.680 80 16 2.559 2.354 3.370 2.865 4.570 2.815 80 32 2.065 2.025 2.817 2.661 3.054 2.979 80 64 1.928 1.817 2.326 2.375 3.495 3.117 80 128 1.523 1.537 1.442 1.453 2.545 3.317 80 256 1.469 1.573 1.423 1.451 1.826 3.451 80 512 1.500 1.560 1.429 1.457 1.735 3.268 80 1024 1.530 1.598 1.480 1.479 1.690 3.057 80 2048 1.705 1.691 1.602 1.558 1.833 2.901 80 4096 1.836 1.810 1.833 1.733 1.782 2.625 80 8192 1.723 1.690 1.759 1.651 1.694 2.473 bits n mul sqr mullow sqrlow divrem gcd 128 1 3.983 3.898 3.950 4.052 1.703 0.815 128 2 3.783 3.085 3.459 2.816 2.205 2.039 128 4 3.056 2.500 3.114 2.584 2.288 1.975 128 8 3.476 3.171 4.557 3.777 2.947 2.054 128 16 2.920 2.689 4.033 3.704 3.934 2.302 128 32 2.235 2.337 3.471 3.409 3.022 2.533 128 64 1.741 1.932 2.870 2.973 2.985 2.794 128 128 1.316 1.316 1.269 1.388 2.578 3.049 128 256 1.311 1.324 1.324 1.306 1.938 3.107 128 512 1.360 1.368 1.314 1.327 1.619 2.863 128 1024 1.429 1.401 1.417 1.359 1.621 2.672 128 2048 1.553 1.538 1.500 1.435 1.694 2.590 128 4096 1.766 1.690 1.787 1.674 1.767 2.341 128 8192 1.772 1.708 1.835 1.773 1.745 2.211 bits n mul sqr mullow sqrlow divrem gcd 180 1 2.176 2.355 2.242 2.355 1.614 0.815 180 2 2.452 1.957 2.215 1.837 1.750 1.804 180 4 2.097 1.906 2.074 1.958 1.977 1.818 180 8 2.379 2.338 3.277 2.976 2.215 1.849 180 16 1.868 1.980 2.782 2.721 2.263 1.810 180 32 1.885 1.614 2.044 2.361 1.952 1.854 180 64 1.745 1.636 1.582 1.817 1.727 1.860 180 128 1.279 1.303 1.242 1.264 1.522 1.896 180 256 1.310 1.303 1.286 1.289 1.373 2.189 180 512 1.310 1.287 1.299 1.261 1.453 2.146 180 1024 1.400 1.377 1.450 1.367 1.528 2.062 180 2048 1.528 1.421 1.585 1.447 1.622 1.970 180 4096 1.664 1.557 1.745 1.608 1.711 1.910 180 8192 1.633 1.538 1.770 1.617 1.688 1.814 bits n mul sqr mullow sqrlow divrem gcd 256 1 2.171 2.406 2.143 2.437 1.429 0.815 256 2 2.153 2.252 2.065 2.139 1.585 1.533 256 4 1.821 2.031 1.850 2.063 1.664 1.545 256 8 2.254 2.551 3.254 3.674 1.872 1.613 256 16 1.882 2.023 2.713 2.910 2.000 1.684 256 32 1.660 1.557 2.138 2.366 1.838 1.750 256 64 1.357 1.402 1.473 1.851 1.529 1.840 256 128 1.138 1.154 1.105 1.065 1.294 1.820 256 256 1.143 1.163 1.112 1.112 1.214 1.911 256 512 1.176 1.194 1.107 1.113 1.221 1.812 256 1024 1.138 1.152 1.098 1.128 1.194 1.828 256 2048 1.174 1.175 1.120 1.120 1.163 1.612 256 4096 1.205 1.191 1.135 1.127 1.167 1.602 256 8192 1.269 1.225 1.239 1.143 1.265 1.523 bits n mul sqr mullow sqrlow divrem gcd 512 1 1.875 1.763 1.861 1.740 1.261 0.841 512 2 1.797 1.780 1.720 1.735 1.363 1.323 512 4 1.500 1.588 1.510 1.561 1.452 1.353 512 8 2.651 2.703 4.338 4.242 1.568 1.365 512 16 2.143 2.000 2.911 3.100 1.611 1.410 512 32 1.561 1.525 1.807 2.125 1.600 1.467 512 64 1.197 1.156 1.204 1.375 1.250 1.459 512 128 1.114 1.135 1.080 1.070 1.382 1.516 512 256 1.110 1.119 1.074 1.073 1.124 1.532 512 512 1.111 1.116 1.104 1.079 1.107 1.493 512 1024 1.104 1.146 1.087 1.075 1.122 1.486 512 2048 1.133 1.147 1.089 1.091 1.105 1.404 512 4096 1.245 1.158 1.182 1.088 1.158 1.365 512 8192 1.346 1.316 1.250 1.186 1.211 1.328 bits n mul sqr mullow sqrlow divrem gcd 1024 1 1.316 1.280 1.316 1.306 1.121 0.815 1024 2 1.253 1.299 1.246 1.321 1.225 1.200 1024 4 1.197 1.211 1.169 1.229 1.266 1.191 1024 8 2.026 2.068 1.255 1.500 1.250 1.205 1024 16 1.435 1.405 1.536 1.806 1.179 1.202 1024 32 0.984 1.116 1.000 1.182 1.039 1.143 1024 64 1.046 0.990 1.035 0.966 1.171 1.183 1024 128 1.060 1.076 1.038 1.044 1.033 1.175 1024 256 1.054 1.091 1.020 1.026 1.051 1.229 1024 512 1.059 1.074 1.048 1.037 1.048 1.243 1024 1024 1.066 1.067 1.041 1.048 1.046 1.213 1024 2048 1.098 1.077 1.089 1.029 1.094 1.203 1024 4096 1.204 1.185 1.134 1.113 1.132 1.190 1024 8192 1.174 1.220 1.123 1.157 1.122 1.171
fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor