fredrikj.net / blog /

Fast nonbinary arithmetic

February 26, 2026

A multiprecision integer is usually stored as an array of machine-word limbs $0 \le a_i < \beta$ in radix (base) $\beta$: $$a_0 + a_1 \beta + \ldots + a_{n-1} \beta^{n-1}.$$ It is generally most efficient to do arithmetic in a large binary radix and convert to nonbinary only when strictly necessary, e.g. when printing to decimal. Both GMP and FLINT work with "mpn" numbers in the full-word radix $\beta = 2^{64}$ internally (I assume a 64-bit machine in this post).

The new radix module in the upcoming version of FLINT supports multiprecision integers in any radix $2 \le \beta < 2^{64}$. Arithmetic can be done efficiently directly in radix $\beta$, in some cases nearly as efficiently as in mpn format. Radix-dependent operations are of course much faster in the proper radix; for example, one can read or write individual base-10 digits in time $O(1)$ when working with a decimal radix.

There are several intended uses of radix, including:

As I will show in this post, the radix module has already allowed speeding up certain operations in FLINT.

Contents

Interface

As in the mpn setting, we generally want to pack multiple "digits" in some small base into "limbs" in the largest radix that fits in a (64-bit) machine word. For example, for base $b = 10$ the optimal limb radix is generally $\beta = 10^{19} \approx 2^{63.1}$ and for base $b = 7$ it is generally $\beta = 7^{22} \approx 2^{61.8}$.

A radix_t context object defines such a limb radix $\beta = b^e$. The radix_t constructor allows the user to choose a specific exponent $e$ or have the largest admissible exponent chosen by default by supplying $e = 0$:

radix_t radix;   // Radix context object

radix_init(radix, 10, 3);  // Initialize radix 10^3
radix_clear(radix);

radix_init(radix, 10, 0);  // Initialize decimal radix with maximal limb size (10^19)
radix_clear(radix);

We provide a low-level interface for working directly with limb arrays, analogous to GMP's mpn layer (and its extension in FLINT). For example, radix_add does the equivalent of mpn_add:

cy = radix_add(res, x, 3, y, 2, radix);
adds the three-limb operand x to the two-limb operand y, storing the low three limbs of the sum in res and returning the carry-out limb.

There are also memory-managed integers radix_integer_t analogous to GMP's mpz_t. Unlike mpz_t there is no legacy 32-bit limitation on the size, so a radix_integer_t can have trillions of limbs if the machine has enough memory. Memory-managed radix integers can also be used via FLINT's GR (generic rings) interface.

Implementation and speed of basic arithmetic

How does radix perform compared to mpn in practice? In this section I will discuss the algorithms for basic arithmetic and present some quick benchmark results for $\beta = 10^{19}$ compared to core mpn functions from FLINT and GMP. Since I've chosen a decimal radix, it also makes sense to compare with the wonderful mpdecimal library (libmpdec) which is the C library underlying Python's decimal module.

Addition

Adding two $n$-limb integers requires $n$ chained adds-with-carry $$a + b + c_{\text{in}} \to s + c_{\text{out}} \beta; \quad s = (a + b + c_{\text{in}}) \bmod \beta, \; c_{\text{out}} = \lfloor (a + b + c_{\text{in}}) / \beta \rfloor.$$ On most modern CPUs, add-with-carry in the mpn radix $\beta = 2^{64}$ is a single instruction which executes in a single clock cycle. Indeed, GMP's mpn_add can sustain 1 cycle/limb on my machine (Zen 3) at least for operands in L1 cache: loop instructions are entirely pipelined, and the limiting factor is the single-cycle latency of the carry chain.

With a custom radix $\beta$, each add with carry requires a short sequence of arithmetic and logical operations to reduce modulo $\beta$. The best C implementations I've come up with for my machine can be seen here; the loop body for addition does

sub_ddmmss(hi, lo, 0, a[i] + (cy + 1), 0, B - b[i]);
res[i] = lo + (hi & B);
cy = hi;
where cy is the negation of the actual carry, and subtraction does
sub_ddmmss(hi, lo, 0, a[i], 0, b[i] - cy);
res[i] = lo + (hi & B);
cy = hi;

These seem to run at 3-4 cycles/limb; I'd be curious to see if any assembly experts out there can improve these further.

Here are some sample timings for addition, with a number of decimal digits $d$ ranging between 19 (1 limb) and 1.72 billion (89.6 million mpn limbs, or 90.8 million decimal limbs):

Addition

Digits mpn limbs radix limbs Time (mpn) Time (radix) Time (mpdecimal)
19 1 1 2.26e-09 2.75e-09 1.74e-08
76 4 4 2.75e-09 4.26e-09 3.76e-08
247 13 13 4.14e-09 1.02e-08 4.73e-08
4009 209 211 5.24e-08 1.99e-07 2.83e-07
102505 5321 5395 1.65e-06 5.13e-06 6.35e-06
8865571 460169 466609 0.000169 0.000462 0.000556
1725409855 89557617 90811045 0.109 0.119 0.123

With just a handful of limbs there isn't a huge difference between mpn_add and radix_add since function call and loop warm-up overheads dominate the actual arithmetic. With many limbs, radix_add runs just less than 4 times slower than mpn_add. For operands that no longer fit in L3 cache memory access becomes the bottleneck with radix_add being only 10% slower than mpn_add.

My radix_add seems to be around 20% faster than mpdecimal's mpd_add, so ≈1 cycle/limb faster. The excess overhead of mpd_add for small operands is due to the fact that this function acts on a memory-managed type; mpdecimal doesn't provide a direct mpn_add-like interface for dealing directly with limb arrays.

Multiplication

Multiplication in radix $\beta$ can be viewed as reinterpreting the integers as polynomials in the formal variable $\beta$, multiplying in $\mathbb{Z}[\beta]$, and then splitting each coefficient into high and low parts $c \to c_{\text{lo}} + c_{\text{hi}} \beta$ and propagating the high parts as carries.

For short operands, it is best to multiply using the $O(n^2)$ schoolbook algorithm. In the mpn radix $\beta = 2^{64}$, schoolbook multiplication is usually implemented with multiplications, splittings and additions interleaved, since the splitting by $\beta$ in a widening limb multiplication $$ab \to c_{\text{lo}} + c_{\text{hi}} \beta,\; c_{\text{lo}} = ab \bmod \beta, \; c_{\text{hi}} = \lfloor (ab) / \beta \rfloor.$$ essentially is free (the CPU can essentially just do a single multiplication and give us the low and high words in separate registers). Indeed, in my recent paper with Albin Ahlbäck we observe that $n$-limb mpn schoolbook multiplication in FLINT costs roughly $1.0 n^2$ cycles on Apple M1 or just 1 cycle per limb product, meaning that multiplications execute essentially as fast as the addition carry chains can keep up; AMD Zen 3 is just slightly slower taking ≈1.3 cycles/limb.

In a general nonbinary radix, the splitting is very expensive as it requires a double-word division and remainder by $\beta$. FLINT's radix module uses Granlund-Möller division with precomputed inverse rather than relying on the hardware division, but this is still far too expensive to do $O(n^2)$ times. Instead, the radix schoolbook multiplication computes the coefficients of the product in $\mathbb{Z}[\beta]$ one by one as three-word integers, requiring just $O(n)$ total splittings by $\beta$.

To multiply large radix integers in (practically speaking) time $M(n) = O(n \log n)$, we rely on fft_small, the same FFT backend FLINT uses to multiply mpn integers (and hence also fmpz_t, arb_t, etc.), nmod_poly, fmpz_poly etc. The FFT-based multiplication code for radix is almost identical to that for nmod_poly multiplication over $(\mathbb{Z} / \beta \mathbb{Z})[x]$, the only difference being that instead of just reducing each output coefficient mod $\beta$, we compute both a quotient and remainder.

There is currently no intermediate-complexity multiplication algorithm: radix_mul switches directly from schoolbook multiplication to FFT around $n \approx 80$. Karatsuba and Toom-Cook multiplication underperform in radix arithmetic since the linear interpolation/extrapolation operations are relatively expensive; optimizing radix operations in the Karatsuba range to work around this problem could be an interesting future project.

Here are some timing results comparing flint_mpn_mul, radix_mul, and mpdecimal's mpd_mul:

Multiplication

Digits mpn limbs radix limbs Time (mpn) Time (radix) Time (mpdecimal)
19 1 1 1.8e-09 7.09e-09 1.4e-08
76 4 4 5.9e-09 5.51e-08 6.69e-08
247 13 13 6.1e-08 2.54e-07 5.87e-07
4009 209 211 7.02e-06 1.31e-05 0.000169
102505 5321 5395 0.000277 0.000334 0.0043
8865571 460169 466609 0.0348 0.0419 0.462
1725409855 89557617 90811045 13.711 16.782 150.269

Multiplication with small operands has much worse relative performance than addition, radix_mul being up to 10x slower than flint_mpn_mul for operands with less than 100 digits (around 1-5 limbs). This is logical since it is the regime where we spend the largest proportion of time on divisions by $\beta$. For numbers with around 1000 digits, the ratio between flint_mpn_mul and radix_mul drops to a more reasonable 3x, and then around 10,000 digits the ratio drops again and radix_mul becomes nearly interchangeable with flint_mpn_mul, being only about 10-30% slower asymptotically.

For large input, mpdecimal is 10-20 times slower than FLINT. This is actually very decent; mpdecimal clearly has pretty good FFT code with good asymptotic scaling, even if the speed isn't quite state of the art. Let me note that there is work from 2020 by Viktor Krapinevsky on fast decimal multiplication reporting a 3x-5x speedup over mpdecimal. It would thus seem that radix is at least least twice as fast as Krapinevsky's code, but I have not tested it myself.

FLINT's FFT multiplication also supports multithreading, but I will not do any multithreaded benchmarks in this post.

Division

Finally, we look at the performance computing the quotient $\lfloor A / B \rfloor$ where $A$ has $2d$ digits and $B$ has $d$ digits.

Basecase division for radix converts to mpn, divides, and then converts back. When the number of limbs $n$ gets large, we use Newton iteration which runs in $O(M(n))$. Newton iteration uses truncated products (radix_mulmid) which saves an appreciable amount of time over full products.

For mpn, we benchmark GMP's mpn_tdiv_qr for small operands and FLINT's fmpz_tdiv_q for large operands. fmpz_tdiv_q is faster than GMP asymptotically but there is currently no direct mpn variant of this function in FLINT. This is compared to radix_divrem (with NULL remainder argument) and mpdecimal's mpd_divint.

Division

Digits mpn limbs radix limbs Time (mpn) Time (radix) Time (mpdecimal)
19 1 1 1.29e-08 5.06e-09 2.18e-08
76 4 4 4.36e-08 1.28e-07 2.45e-07
247 13 13 1.81e-07 5.95e-07 1.41e-06
4009 209 211 1.47e-05 3.06e-05 0.000249
102505 5321 5395 0.000888 0.000873 0.0208
8865571 460169 466609 0.106 0.108 2.334
1725409855 89557617 90811045 44.31 51.578 831.818

Asymptotically radix division is nearly as fast as mpn/fmpz division, and actually slightly faster in some cases; this is due to incorporating some optimizations that I haven't yet been able to put in FLINT's fmpz division code. As noted earlier, a good use of radix is to experiment with algorithm optimizations!

Converting to mpn and back probably isn't the optimal way to do basecase division, but the benchmarking suggests that this approach isn't actually horrible: "just" 3x slower than the direct mpn division, i.e. much better relatively speaking than schoolbook-sized multiplication. Optimizing schoolbook division natively for radix looks pretty hard though, so it's probably not something I'm going to attempt in the near term.

Fast radix conversion

Radix conversion is the task of converting an integer from base $\beta$ to another base $\gamma$: $$a_0 + a_1 \beta + \ldots + a_{n-1} \beta^{n-1} \to b_0 + b_1 \gamma + \ldots + b_{m-1} {\gamma}^{m-1}$$ This can either be implemented as repeated multiplication by powers of $\beta$ in the arithmetic of base $\gamma$, or as repeated division with remainder by powers of $\gamma$ in the arithmetic of base $\beta$. The complexity of either method is $O(M(n) \log n)$ when choosing powers in a divide-and-conquer fashion. However, multiplication is typically 2-4 times faster than division with remainder, so the method based on multiplication is clearly preferable.

The main use for radix conversion is to convert between binary and nonbinary. In the radix module, conversion between base $2^{64}$ and a custom radix is handled by the following methods:

The FLINT functions fmpz_set_str and fmpz_get_str internally convert between radix $\beta = 2^{64}$ and $\gamma = 10^{19}$ for decimal input and output. Decimal output (fmpz_get_str) previously used division since we only had binary mpn arithmetic for integers. In FLINT 3.5-dev, we now use the multiplication-based radix_set_mpn instead.


The result: decimal output is now almost 2x faster asymptotically. For example, printing a 100 million-digit fmpz_t previously took 7.3 seconds and now takes 4.2 seconds. There is also a speedup in the basecase code for small integers around 100 digits. These improvements also affect printing arb_t.

Note: there is a different way to achieve a similar speedup while still working in the binary radix, using scaled remainder trees, but it is more complex to implement.

Many-digits benchmarks: Fibonacci, sqrt(2), e

Let's look at how radix compares to FLINT's mpn-based fmpz_t integers and arb_t real numbers on three high-precision digit benchmarks:

The point is that if we use fmpz_t or arb_t, we have to follow up the actual computation of the number with a radix conversion to obtain the decimal digits. For the first two benchmark problems which have complexity $O(M(n))$, radix conversion should dominate the actual computation asymptotically; for the third task (calculating $e$) which costs $O(M(n) \log n)$, radix conversion should take comparable time to the actual computation. When we use decimal radix arithmetic instead, converting to a string just requires converting the limbs one by one in $O(n)$ time. It is thus possible for radix to be faster overall even if the actual computation of the number is slower.

To benchmark Fibonacci numbers, we compare the base types fmpz_t and radix_integer_t with the same generic implementation (gr_fib_ui) of the GMP Fibonacci number algorithm which uses a binary powering-like recurrence.

Fibonacci numbers

Time (fmpz) Time (radix)
$n$ Digits in $F_n$ Compute $F_n$ To string Compute $F_n$ To string Speedup
$10^3$ 209 1.06e-07 2.41e-07 1.03e-06 8.42e-8 0.31x
$10^4$ 2090 1.53e-06 7.10e-06 6.81e-06 7.16e-7 1.15x
$10^5$ 20899 4.95e-05 0.000198 0.000101 7.04e-6 2.29x
$10^6$ 208988 0.00127 0.00353 0.00103 7.06e-5 4.36x
$10^7$ 2,089,877 0.00800 0.0462 0.0108 0.000706 4.71x
$10^8$ 20,898,764 0.131 0.719 0.201 0.00706 4.09x
$10^9$ 208,987,640 1.67 9.87 2.54 0.145 4.30x
$10^{10}$ 2,089,876,403 22.0 133.2 27.3 1.38 5.41x

The speedup ratio shows the improvement in total time (compute $F_n$ + convert to string). As expected, the fmpz version is generally faster for the Fibonacci number computation (by up to 10x for small $n$ and 20-50% for large $n$), but radix starts to win by a large margin from $n = 10^4$ and up due to the string conversion.

For $\sqrt{2}$ and $e$ we compare the existing arb_sqrt_ui and arb_const_e to new radix implementations that I have tried to optimize as well as possible. The basic algorithms are the same as used by the Arb functions (Newton iteration for $1\sqrt{2}$; binary splitting for the $e$ series). There is no radix-based real float or ball type yet (watch for this in the future), so for this benchmark we just compute radix approximations of $\sqrt{2} \cdot 10^n$ and $e \cdot 10^n$ as integers using fixed-point arithmetic.

Digits of $\sqrt{2}$

Time (arb) Time (radix)
Digits Compute $\sqrt{2}$ To string Compute $\sqrt{2}$ To string Speedup
$10^3$ 9.04e-07 5.73e-06 2.49e-06 3.44e-07 2.34x
$10^4$ 2.98e-05 0.000105 5.18e-05 3.38e-06 2.44x
$10^5$ 0.000507 0.00198 0.000505 3.38e-05 4.62x
$10^6$ 0.00537 0.026 0.00536 0.000339 5.50x
$10^7$ 0.0638 0.375 0.0705 0.00341 5.94x
$10^8$ 1.154 5.142 1.09 0.0341 5.60x
$10^9$ 15.61 69.072 13.269 0.67 6.08x

The speedup is similar to that on the Fibonacci benchmark. The radix square root code actually performs a bit better than the one for arb at some sizes; it is clear that various existing functions in FLINT would benefit from some of the optimizations I've so far only implemented for radix.

Digits of $e$

Time (arb) Time (radix)
Digits Compute $e$ To string Compute $e$ To string Speedup
$10^3$ 4.01e-05 6.17e-06 3.56e-05 3.56e-07 1.29x
$10^4$ 0.000402 0.000126 0.000463 3.49e-06 1.13x
$10^5$ 0.00505 0.00213 0.00537 3.5e-05 1.33x
$10^6$ 0.0564 0.0278 0.0622 0.000351 1.35x
$10^7$ 0.673 0.375 0.79 0.00352 1.32x
$10^8$ 8.844 5.061 10.592 0.0355 1.31x
$10^9$ 119.609 70.872 140.731 0.674 1.35x

For $e$, radix conversion is a bit faster than computing the actual number, but radix nevertheless wins by a small factor.

What about digits of $\pi$? I expect radix to be slower than arb since the Chudnovsky series costs $O(M(n) \log^2 n)$. I have not implemented this algorithm for radix yet as it requires truncated binary splitting for best performance; this will make more sense to benchmark once there is proper floating-point code.

p-adic arithmetic

Finally, let us look at (finite-precision) arithmetic in the $p$-adic field $\mathbb{Q}_p$, where $p$ is assumed to be a word-size prime number. At a low level, this is essentially arithmetic in the residue ring $\mathbb{Z} / p^N \mathbb{Z}$ for some precision parameter $N$. There are several preexisting alternatives for such arithmetic in FLINT:

All implementations use binary arithmetic, so a multiplication generally requires computing the full integer product and then doing a division with remainder by $p^N$. In contrast, multiplying $p$-adically using radix just requires computing a truncated product. This is both faster and avoids the need to precompute $p^N$. Consider the following benchmark for $7$-adic multiplication, where we use radix with the limb radix $B = 7^{22}$ (i.e. $n$ limbs gives $N = 22n$ digits of precision):

Multiplication mod $(7^{22})^{n}$

Limbs $n$ Time (padic) Time (fmpz_mod) Time (mpn_mod) Time (radix)
1 6.98e-08 4.77e-09 - 9.58e-09
2 8.69e-08 1.08e-08 7.74e-09 1.54e-08
4 1.27e-07 6.97e-08 3.39e-08 3.38e-08
8 2.09e-07 1.32e-07 7.29e-08 7.03e-08
16 4.48e-07 3.38e-07 2.21e-07 2.37e-07
32 1.19e-06 8.76e-07 - 6.09e-07
64 3.42e-06 2.57e-06 - 1.86e-06
256 3.26e-05 2.72e-05 - 9.67e-06
1024 0.000253 0.000146 - 4.3e-05
4096 0.00134 0.000674 - 0.000188
65536 0.0247 0.0116 - 0.0036
1048576 0.576 0.265 - 0.0845
8388608 6.08 2.71 - 0.831

The difference in speed between fmpz_mod and padic is due to the former using a precomputed inverse of $p^N$ for division, while padic only precomputes $p^N$ itself (note that precomputations for the fmpz_mod and padic contexts are omitted in this benchmark). This optimization would clearly be worthwhile for padic too. The arithmetic in mpn_mod is essentially the same as in fmpz_mod except that there is less overhead. However, radix is competitive even with mpn_mod at low precision, and more than 3x faster faster than fmpz_mod at high precision. This suggests that the best way to optimize high-precision $p$-adic arithmetic in FLINT going forward may be to base it on radix instead of mpn/fmpz, at least for certain operations.

Finally, we consider $p$-adic division. Both fmpz_mod and mpn_mod are not aware of the special form of the modulus and will compute an extended GCD when asked for a modular inverse, which costs $O(M(n) \log n)$. Both padic and radix use Newton iteration with repeatedly doubling $p$-adic precision which runs in $O(M(n))$.

Inverse mod $(7^{22})^{n}$

Limbs $n$ Time (padic) Time (fmpz_mod) Time (mpn_mod) Time (radix)
1 2.19e-07 4.74e-08 - 3.02e-08
2 3.54e-07 3.12e-07 2.60e-07 5.1e-08
4 5.9e-07 6.29e-07 5.75e-07 1.03e-07
8 8.65e-07 1.39e-06 1.33e-06 2.07e-07
16 1.47e-06 2.89e-06 2.84e-06 3.71e-07
32 3.01e-06 6.36e-06 - 9.84e-07
64 7.69e-06 1.59e-05 - 2.55e-06
256 6.25e-05 0.000135 - 2.21e-05
1024 0.00051 0.00133 - 9.65e-05
4096 0.00329 0.0106 - 0.000423
65536 0.061 0.487 - 0.00817
1048576 1.27 15.4 - 0.193
8388608 12.7 202.0 - 1.92

Using the right algorithm here is clearly important as radix gets more than 100x faster than fmpz_mod asymptotically. We also achieve a 6x speedup over padic as the Newton iteration is more efficient when implemented using truncated multiplications in the proper radix.



fredrikj.net  |  Blog index  |  RSS feed  |  Follow me on Mastodon  |  Become a sponsor