High-precision ball arithmetic

About a week ago, I started working on a C library implementing arbitrary-precision real balls, creatively titled Arb. It’s mostly based on FLINT types, using MPFR for testing and for fallback code.

Ball arithmetic is intended to allow efficient, rigorous high-precision numerical evaluation. A real ball is implemented as an interval $[x-r, x+r] \times 2^e$ where $x, e, r$ are integers. This representation supports error propagation while avoiding the ~2x speed and memory overhead of the more conventional endpoint interval representation $[a \times 2^e, b \times 2^f]$.

There are two drawbacks: balls are somewhat less flexible, and they have slightly higher overhead at low precision. At the moment multiplying two arb_t numbers is about 3x slower than multiplying two mpfr_t numbers at 100-400 bits (which would make it somewhat less efficient than MPFI arithmetic). Above 1000 bits, the difference is only a few percent.

The current code is completely naive, and with a bit of low-level hacking it should be possible to get similar performance to a single mpfr_t multiplication at 200-1000 bits. Up to perhaps 256 bits, it would be attractive to have faster, fixed-precision interval types as a complement. (There is actually some prototype code for this hidden somewhere in a branch of my FLINT repository on GitHub…)

I will soon add complex numbers, polynomial balls, and matrix balls (based on exact FLINT polynomials and matrices). In fact, the main reason that I created an arb_t scalar type at all is to simplify interfacing with and writing test code for polynomial balls. But the type can clearly be useful on its own. When the code stabilizes enough, I plan to merge all this back into FLINT.

Part of the design of the arb_t type is that it can hold small integers efficiently, only truncating them to approximate numbers when they grow very large. This is useful for performing binary splitting on infinite series, reducing memory usage (and possibly improving performance) compared to using exact integers throughout.

The arb_t type is still very incomplete (and some functions don’t quite work as they should in all cases), but the module is functional enough that I’ve been able to implement binary splitting of a couple of constants. It’s interesting to see how much of an improvement (if any) ball arithmetic gives over using plain integers, so I’m including some benchmarks below. These functions are not the goal of the Arb library; I plan to do more useful things with it later, but they’re fun to look at as a start.

First out are some timings for π using the Chudnovsky algorithm, compared to the MPFR pi function, Mathematica 7, and the slightly modified version of the gmp-chudnovsky program included in FLINT. All times are measured in seconds.

Digits MMA7 MPFR gmp-chudnovsky Arb
105 0.14 0.17 0.05 0.05
106 2.5 3.81 0.92 1.16
107 42.53 67.5 15.9 22.6
108 1380 252 339

Using ball arithmetic turns out to be pretty much exactly as fast as using plain integers (timings not included), so at least the arb_t is not adding much overhead compared to integers. The gmp-chudnovsky code is around 40% faster because it works with partially factored integers, requiring a lot of extra code (it would be interesting to turn this into a general integer type and find other uses for it) and because it uses heuristic numerical code for the final square root and division. MPFR is slower because it uses the AGM algorithm. Mathematica is probably slower because it relies on an old GMP version.

Timings for Euler’s constant (γ) using the Brent-McMillan binary splitting algorithm, compared to Mathematica, MPFR and the integer (fmpz) binary splitting code in the FLINT arith in FLINT:

Digits MMA7 MPFR fmpz Arb
104 0.18 0.20 0.12 0.07
105 6.42 5.00 3.14 2.36
106 138 174 66.5 51.7

Ball arithmetic gives a small speedup. Memory usage is reduced significantly compared to the fmpz code. I was also able to compute 107 digits, but I lost the output of how long it took (roughly half an hour, I think). MPFR is slower because it uses a different algorithm.

The speedup for the binary splitting stage is actually larger than indicated, because 1/3 of the total time is spent computing log(n) with MPFR, where n is a parameter. One can save time by choosing n to be a power of two (log(2) is fast), but this makes the binary splitting slower when n is much smaller than the next power of two. I don’t know a general way to compute log(n) faster. Perhaps one could round n up to $m 2^e$ where $m \le 7$, and try to compute log(2), log(3), log(5), log(7) quickly. It’s worth pointing out that the binary splitting for power-of-two parameter can be optimized a bit by using bit shifting; this is not implemented yet.

Riemann zeta constants $\zeta(n)$ for integer n, compared to MPFR and the fmpz binary splitting code in the FLINT arith module:

n Digits MMA7 MPFR fmpz Arb
13 104 13.7 0.24 0.58 0.19
13 105 27.3 14.0 6.54
13 106 3639 287 148
43 104 13.7 0.36 2.29 0.50
43 105 43.2 51.6 18.2
43 106 1028 417

Here ball arithmetic gives a nice speedup which increases with larger n. Mathematica and MPFR both use algorithms with complexity somewhere around $\tilde O(b^2)$ where $b$ is the number of digits, whereas the algorithm I implemented has complexity $\tilde O(nb)$. Due to its overhead, it only becomes fast for small n and b in the tens of thousands.

Finally, using a separate hypergeometric series for $\zeta(3)$, this particular constant can be computed even faster. Here ball arithmetic gives a very small speed improvement over using fmpzs:

Digits MMA7 fmpz Arb
105 1.68 0.29 0.24
106 38.9 6.52 5.30
107 772 127 107

(I’m omitting MPFR because it doesn’t have a separate algorithm implemented for this constant.)

Later on, I will add faster code for $\zeta(n)$ where n is large or the precision is small, and code for combined evaluation of $\zeta(2), \zeta(3), \ldots, \zeta(n)$ (used for series expansions of various functions).

Posted in flint, math | Comments Off

Partitions into the quintillions

One of my biggest undertakings last year was to implement the partition function $p(n)$ in FLINT. With this code, I was able to set a record by computing the number of partitions of $10^{19}$, or 10,000,000,000,000,000,000 (ten quintillion).

The number $p(10^{19})$ turns out to have 3,522,804,578 digits, starting and ending with

56469284039962075996762611156427010823552403269435
56912115036797282138529825391237281687215156415242
...
          <3522804378 digits omitted>
                                               ...
18143637106903146536175827132211646720469971909621
95008234537397439410884324996338596264493674631046

More simply, $p(10^{19}) \approx 5.65 \times 10^{3,522,804,577}$ (this kind of approximation is easy to compute; it is the least significant digits that are challenging — or the more significant digits, for the $p$-adically inclined). This number answers the following practically important combinatorial problem: in how many ways can one choose sticks whose lengths are multiples of one meter, such that they add up to $10^{19}$ meters — approximately the thickness of the Milky Way — when placed end to end?

Okay, computing $p(10^n)$ is just a frivolous benchmark problem. Perhaps more interestingly, at least for number theorists, I generated over 22 billion new Ramanujan-type congruences (identities of the form $p(Ak+B) \equiv 0 \bmod m$ for all $k$) using the algorithm of Rhiannon Weaver. This greatly extends the computation of 76,065 congruences done by Weaver in 2001. An example of a new identity is

$$p(28995244292486005245947069 k + 28995221336976431135321047) \equiv 0 \; \operatorname{mod} \; 29$$

for all values of $k$.

Finding the 22 billion new congruences required evaluating the partition function of approximately 470,000 distinct values up to $n \approx 10^{13}$. This took approximately 150 CPU days distributed over 40-48 cores on a computer at the University of Warwick (access provided courtesy of Bill Hart).

The computation of $p(10^{19})$ took just less than 100 hours of CPU time and roughly 150 GiB of RAM, running on a single core. With 2-3 months of CPU time and 2 TB of swap space, it should be possible to compute the number of partitions of one sextillion ($10^{21}$) with the FLINT code, if anyone is up for the task (unfortunately, parallelization for a single $n$ is not supported, and would be difficult to implement to save more than a factor two).

There are actually three implementations of the partition function in FLINT: one for computing $p(0), p(1), \ldots p(n-1)$, another for computing the same set of values modulo a word-size integer, and finally a function for computing the isolated value $p(n)$. The first two are straightforward applications of FLINT’s fast power series arithmetic, and there is perhaps not that much to say about them (with several gigabytes of RAM, you can comfortably compute perhaps $10^6$ exact values or $10^9$ values modulo a small integer).

The computation of isolated values uses the Hardy-Ramanujan-Rademacher formula

$$p(n)=\frac{1}{\pi \sqrt{2}} \sum_{k=1}^\infty \sqrt{k}\, A_k(n)\,
\frac{d}{dn} \left(
\frac {1} {\sqrt{n-\frac{1}{24}}}
\sinh \left[ \frac{\pi}{k}
\sqrt{\frac{2}{3}\left(n-\frac{1}{24}\right)}\right]
\right)$$

where

$$A_k(n) = \sum_{0 \le m < k, (m,k) = 1} \,e^{ \pi i \left[ s(m,\, k) \;-\; \frac{1}{k} 2 nm \right]}$$

and $s(m,k)$ denotes a Dedekind sum. This is a numerical infinite series that has to be evaluated approximately. Truncating it appropriately and using a sufficiently high numerical precision gives an approximation that is guaranteed to round to the correct integer.

The FLINT implementation has complexity $O(n^{1/2+\varepsilon})$, which is quasi-optimal since $p(n)$ has about $n^{1/2}$ digits. Implementing the Hardy-Ramanujan-Rademacher formula with this complexity is rather complicated. Ostensibly, there are $n^{3/2}$ terms in the nested sums, and on top of that you need to work with numbers up to $O(n^{1/2})$ digits.

It was only after a lot of digging in the literature that I stumbled across a paper of A. L. Whiteman that gives identities for factoring the exponential sums $A_k(n)$ into short cosine products, reducing the number of terms. This uses quite a bit of modular arithmetic, and implementing it efficiently relies on the fast routines in FLINT for computing gcd, modular square roots, Legendre symbols, factorizations, etc. of word-size integers.

The numerical computations are done using a combination of MPFR arithmetic, some custom routines for high-precision transcendental functions, and machine precision arithmetic for small terms, with carefully managed precision across the whole computation.

Until now, the largest reported computations of $p(n)$ have involved $n$ around $10^9$. Some software certainly allowed going higher than $10^9$, but this was incredibly slow. My code makes it easy to go much higher on commodity hardware, the primary limitation being available memory. For example, computing $p(10^{16})$ with FLINT takes less than two hours on my laptop, using all the available 3 GiB of RAM (plus some swap space).

Compared to the previously best software (Mathematica and Sage), the FLINT code runs around 500 times faster for large $n$. For example $p(2^{32}-1)$ takes half a second in FLINT and around 200 seconds in both Sage and Mathematica.

The image is a loglog plot of the time needed to compute $p(n)$ using Mathematica 7 (green circles), Sage 4.7 (red triangles), and FLINT (blue squares). The thin dotted line indicates the slope of the trivial lower complexity bound $\Omega(n^{1/2})$ just for writing out the result. A quick extrapolation suggests that $p(10^{19})$ would take about a decade to compute with Mathematica and a million years with Sage (after patching Sage to allow $n$ larger than 32 bits).

More details about the FLINT partition function implementation (and the computational results) will be given in a forthcoming paper.

Posted in flint, partitions, sage | Comments Off

Factorials mod n and Wilson’s theorem

Wilson’s theorem states that an integer greater than 1 is a prime if and only if $(n-1)! \equiv -1 \bmod n$. This immediately gives a simple algorithm to test primality of an integer: just multiply out $1 \times 2 \times \cdots \times (n-1)$, reducing each intermediate product modulo $n$, and check that the final result equals $n – 1$.

The running time is obviously $O(n)$ (assuming that $n$ fits in a single machine word so that arithmetic has constant cost). While elegant, this algorithm is useless in practice because even the most basic non-stupid primality test, trial division, only costs $O(n^{1/2})$ in the worst case.

Perhaps surprisingly, the complexity of primality testing using Wilson’s theorem can be brought down to about $O(n^{1/2})$ as well. The idea is to use fast polynomial arithmetic to compute the factorial faster than by the naive method. Assuming for simplicity that $n – 1$ is a perfect square, let $m = (n-1)^{1/2}$. We form the polynomial $P(x) = (x+1)(x+2) \cdots (x+m)$ and evaluate $P(x)$ simultaneously at the points $x = 0, m, \ldots m(m-1)$. Then $P(0) P(m) \cdots P(m(m-1)) = (n-1)!$. If $n – 1$ is not a perfect square, we choose $m = \lfloor n^{1/2} \rfloor$ and fill in the missing factors by repeated naive multiplication.

All the required operations can be done in roughly $O(n^{1/2})$ time using FFT-based polynomial multiplication and balanced subproduct trees. The complexity is actually slightly higher, about $O(n^{1/2} \log^3 n)$, and the constant overhead is huge, so the method is still considerably slower than trial division. But it is interesting to see how it performs in practice.

Since I recently implemented fast multipoint evaluation in FLINT, the fast factorial algorithm became easy to implement as well. In my repository, it is now enabled by default for computing factorials modulo an integer (n_factorial_mod2_preinv) when the input is large enough; the code is here. Here is how it compares to the naive algorithm for computing $(n-1)!$ modulo $n$:

n Naive factorial Fast factorial
10 12 ns 1.2 μs
102 0.46 μs 4.4 μs
103 7 μs 22 μs
104 78 μs 100 μs
105 0.89 ms 0.52 ms
106 9.8 ms 3.1 ms
107 110 ms 18 ms
108 1.2 s 0.12 s
109 12 s 0.71 s
1010 151 s 3.5 s
1011 1709 s 15 s
1012 5 h (est.) 70 s
1013 50 h (est.) 307 s
1014 500 h (est.) 1282 s

The numbers agree reasonably well with theory (the naive algorithm does not quite take a constant multiple of $n$ time because the arithmetic is done slightly faster when several factors can be accumulated in a single word, and the ratio per order of magnitude for the fast algorithm is a bit larger than 4 and not $10^{1/2} \approx 3.16$, but that is reasonable considering the extra log factors in the complexity). Around 1015, my laptop is running out of memory. To compute larger factorials, one could choose $m$ smaller than the square root of the input, performing several multipoint evaluations. Choosing a fixed large $m$ (say 107) gives an algorithm with $O(n)$ complexity, but a much smaller constant than the naive factorial algorithm.

Although we have achieved a factor 1000 speedup over the naive factorial algorithm and made Wilson’s theorem a feasible primality test for numbers as large as 15 digits without requiring special hardware or patience, it remains completely useless for practical purposes. Trial division is around 10,000 times faster (dividing by all integers up to $(10^{14})^{1/2}$ at 10 ns per iteration takes 0.1 seconds), and sophisticated algorithms are much faster still (the FLINT function n_is_prime tests primality of a number this size in about 7 μs, and n_factor factors it in about 500 μs).

Posted in flint, math, Uncategorized | Comments Off

Blog moved

I’m moving my blog from Blogger (http://fredrik-j.blogspot.com/) to a WordPress installation on my own domain (http://fredrikj.net/blog). This way I won’t be at Google’s mercy in the future, I can tinker with things more easily, and I can use cool plugins like TeX conversion $$24 \sum_{k=1}^{\infty} \frac{1}{k^5} = \int_0^{\infty} \frac{x^4}{e^x-1} dx.$$

I’ve imported the entries from the old blog. Unfortunately, the plugin I used messed up all <pre> text, so there are now a million broken code examples. I have fixed up the most recent half of the archive manually, but the oldest posts are still broken.

Not that I’ve been blogging that much recently, but you never know… it goes in waves.

Posted in blog | Tagged | 1 Comment