# Computing classical combinatorial numbers

November 23, 2021

Time to revisit an old topic! I have recently put some effort into optimizing the functions for computing (exactly) Bernoulli numbers, Euler numbers, Stirling numbers and Bell numbers in Flint and Arb. This is mostly for fun (to see how much code I wrote a long time ago can be improved), but these are number sequences that crop up in many applications in combinatorics and number theory, so perhaps someone will benefit from the faster code. This post will look at the improvements and give a quick survey of the best available algorithms.

## Bernoulli and Euler numbers

The Bernoulli numbers $B_n$ and Euler numbers $E_n$ are defined by the exponential generating functions $$\frac{x}{e^x-1} = \sum_{n=0}^{\infty} \frac{B_n}{n!} x^n, \quad \frac{1}{\cosh(x)} = \sum_{n=0}^{\infty} \frac{E_n}{n!} x^n.$$ An entirely reasonable way to compute these numbers is to perform a direct power series division. In practice, it is even more efficient to write the numbers as special L-function values $$B_{2n} = (-1)^{n+1} \frac{2 (2n)!}{(2\pi)^{2n}} \zeta(2n), \quad E_{2n} = (-1)^n \frac{4^{n+1} (2n)!}{\pi^{2n+1}} \beta(2n+1)$$ (where $\beta(s)$ is the Dirichlet beta function), evaluate the L-functions numerically to sufficiently high precision using the Dirichlet series or Euler products (e.g. $\zeta(s) = \prod_p (1-p^{-s})^{-1}$) , and recover the exact values from the numerical approximations. This is the method used until now in Pari/GP, Arb, and several other libraries.

In 2008, David Harvey introduced a multimodular algorithm for Bernoulli numbers. The idea is to evaluate evaluate $B_n \bmod p$ using many small primes $p$ and reconstruct the numerator of $B_n$ via the Chinese remainder theorem. His algorithm is based on the congruence $$B_{2n} \equiv \frac{2n}{(1-c^{2n})} \sum_{j=1}^{p-1} j^{2n-1} h_c(j) \pmod p$$ where $h_c(x) = [(x \mod p) - c (x/c \mod p)]/p + (c-1)/2$ and $c$ and $p$ are subject to certain restrictions. This summation of length $O(p)$ is further rewritten in several stages and implemented using various technical optimizations to gain a large constant factor speedup (more than 150x) over a naive loop. David's open source implementation (bernmm) performs better than the Euler product method (as implemented in Arb, for example) when $n$ is sufficiently large, although the improvement is slight.

Now, an interesting observation is that the L-function method is very efficient at recovering the high digits of $B_n$ but starts to slow down towards the low digits. The multimodular algorithm, conversely, recovers the low digits very efficiently but slows down towards the high digits. So why not combine the methods and get the best of both worlds? This does work, and it turns out to be significantly faster than either method alone. The new code for computing Bernoulli numbers exactly in Arb does precisely this, combining the existing Euler product code in Arb with an adaptation of David Harvey's modular code. Here are sample timings for computing isolated large Bernoulli numbers:

Number bernmm Arb (old) Arb (new)
$B_n, \; n = 10^4$ 0.10 s 0.012 s 0.012 s
$B_n, \; n = 10^5$ 2.7 s 1.9 s 0.83 s
$B_n, \; n = 10^6$ 222 s 224 s 38 s
$B_n, \; n = 10^7$ ? ? 2903 s

The hybrid algorithm uses an experimentally-determined tuning parameter to choose how many bits to compute via the Euler product and how many bits to compute via the multimodular algorithm. The optimal value increases with $n$ until it stabilizes around $\alpha = 0.28$, meaning that asymptotically 28% of the bits are computed using the multimodular algorithm and 72% via the Euler product.

This hybrid algorithm is unpublished as far as I'm aware, but I'm not the first person to think of it: David Harvey and Edgar Costa studied it a few years ago (private communication) and they observed that, theoretically, it actually gives a logarithmic asymptotic speedup and not just a constant factor improvement. The empirical timings seem to bear this out.

For Euler numbers, an analogous hybrid algorithm is possible. I use the congruence $$E_{2n} \equiv 4^{2+2n-p} (-1)^{(p-1)/2} \sum_{j=1}^{\lfloor p / 4 \rfloor} j^{2n} \pmod p$$

which is valid when $2n \not \equiv 0 \pmod {p-1}$. To generate the powers, I construct a divisor table (this can be done once for all primes) and compute $j^n = d^n e^n$ when $j = d e$ is composite, performing binary exponentiation only when $j$ is prime. David Harvey's algorithm for Bernoulli numbers iterates over powers of a primitive root mod $p$ instead, along with several other tricks that I did not attempt to adapt to the Euler number case (one idea I did borrow is to use Montgomery reduction since multiplications are the dominant operations). As an additional optimization, I factor out the powers of two so that the sum has half the length.

The hybrid method performs a bit better than the old, purely numerical algorithm:

Number Arb (old) Arb (new)
$E_n, \; n = 10^4$ 0.032 s 0.0282 s
$E_n, \; n = 10^5$ 6.123 s 3.816 s
$E_n, \; n = 10^6$ 765 s 406 s

The speedup is smaller than for Bernoulli numbers since the mod $p$ code is slower. Indeed, here the tuning parameter only goes up to $\alpha = 0.11$. Perhaps someone else can improve this further; besides the techniques in bernmm, there is also a recent paper by Hathi, Mossinghoff and Trudgian which presents congruences for Euler numbers requiring fewer terms.

### Subquadratic algorithms

All the above algorithms have $n^{2+o(1)}$ asymptotic bit complexity. In 2012, David Harvey discovered the first subquadratic algorithm for Bernoulli numbers. This essentially a multimodular $p$-adic algorithm, making clever use of fast polynomial arithmetic to compute $B_n$ in time $n^{4/3+o(1)}$. This algorithm is not currently used in any libraries, and I do not know precisely where it starts to win over the hybrid numerical-multimodular method. It looks like it might generalize to Euler numbers, but I have not tried to work out the details.

Last month, I discovered a second, completely different subquadratic algorithm for Bernoulli numbers, which also immediately works for Euler numbers (and many other things). This is a purely numerical algorithm based on rapid computation of Dirichlet L-functions using the approximate functional equation together with bit-burst evaluation of holonomic functions, and it achieves complexity $n^{3/2+o(1)}$. Sadly, my implementation of this subquadratic algorithm runs a thousand times slower even than the classical Euler product method for computing $B_n$ or $E_n$ when $n \le 10^6$, so it it not useful in practice for this application. (The algorithm works very well for some other things: for example, I used it to calculate $\zeta(1/2)$ to 1 million digits.)

### Vector computation

To generate $B_0, \ldots, B_n$ simultaneously, Arb uses Bloemen's power-recycling zeta function algorithm discussed in my paper on gamma function computation. This algorithm looks unbeatable in practice despite having worse asymptotic complexity than the power series method ($n^{3+o(1)}$ versus $n^{2+o(1)}$). For example, $n = 10^5$ only takes 10 minutes while the multimodular power series algorithm implemented in Flint needs 50 minutes. It remains to implement the same method for Euler numbers; there is currently no function for this in Arb, only a vector function in Flint which uses the power series method.

## Bell numbers

I discuss computing Bell numbers $B_n$ (not to be confused with Bernoulli numbers!) in a blog post from 2015. The two main algorithms for computing isolated Bell numbers are Dobinski's formula $$B_n=\frac{1}{e}\sum_{k=0}^\infty \frac{k^n}{k!}$$ and the corresponding discrete sum $$B_n = \sum_{k=1}^n \frac{k^n}{k!} \sum_{j=0}^{n-k} \frac{(-1)^j}{j!}.$$ The code for Bell numbers in Flint that I wrote 10 years ago uses Dobinski's formula when $n$ is small (up to a few thousand) and multimodular evaluation (with 64-bit primes) of the discrete sum when $n$ is large. As far as I know, this has been the fastest available implementation.

Revisiting the code, I found some things to improve. For the Dobinski formula, the old code computed the integer $P = N! \sum_{k=0}^N \frac{k^n}{k!}$ using binary splitting and then divided this by $N!$ and by a numerical approximation of $e$ (for a suitably chosen $N$). The new code uses sieving for the powers (small speedup) and performs iterative scalar updates instead of binary splitting (another small speedup). Finally, instead of approximating $e$ it computes the integer $Q = \sum_{k=0}^{N} N!/k! \approx e N!$ which gives $B_n = \lceil P / Q \rceil$. This is not necessarily faster, but it is more elegant (only integer arithmetic is used).

For the multimodular algorithm, I found some low-level optimizations as well: precomputing a divisor table that can be used for all primes, optimizing the arithmetic operations for full-word primes, and performing the outer summation as a dot product without intermediate reductions.

Sample timings:

Number Flint (old) Flint (new)
$B_n, \; n = 10^3$ 0.00382 s 0.00237 s
$B_n, \; n = 10^4$ 0.988 s 0.52 s
$B_n, \; n = 10^5$ 149 s 66.4 s

I'm pretty happy with a factor-two speedup here since there are no big algorithm changes.

### Possible improvements?

Is it possible to compute Bell numbers faster using some kind of hybrid algorithm? Some leading digits can indeed be computed very quickly using Dobinski's formula, and residues modulo small primes $p$ can be computed quickly using Touchard's congruence $B_{p^m+n} \equiv mB_n+B_{n+1} \pmod p$ (see this paper by Samuel Wagstaff for an algorithm). However, the proportion of leading or trailing digits that can be retrieved quickly by either method seems to be small compared to the size of $B_n$ and this proportion probably decreases with larger $n$. I don't see more than perhaps a 10% speedup being possible for any $n$ by these methods unless I missed a major trick somewhere. I have no idea how to save a large constant factor, let alone achieve subquadratic complexity. If someone has ideas, I'm interested.

### Vector evaluation

For computating a vector of Bell numbers, Flint uses the triangular recurrence for small $n$ and multimodular evaluation of the exponential generating function $$e^{e^x-1} = \sum_{n=0}^\infty \frac{B_n}{n!} x^n$$ for large $n$. There is not much to improve here short of radically optimizing Flint's polynomial arithmetic; I managed to shave a few percent from the running time with some low-level changes ($B_0, \ldots, B_{10000}$ now takes 44 seconds instead of 47 seconds). An implementation of the ordinary generating function $$\sum_{k=0}^{\infty} \frac{x^k}{\prod_{j=1}^k (1-jx)} = \sum_{n=0}^{\infty} B_n x^n$$ using binary splitting in $\mathbb{Z}[x]$ runs almost exactly as fast as the multimodular EGF for large $n$, but does not seem to beat it.

## Stirling numbers

For Stirling numbers of the first and second kind, there are a plethora of generating functions to choose from. A fast implementation comes down to evaluating these power series efficiently and choosing the best formula depending on the combination of parameters $n, k$.

### Of the first kind

For Stirling numbers of the first kind, we have the "forward" and "reverse" ordinary generating functions (OGFs) $$S_1(n,k) = [x^{k-1}] \prod_{j=1}^{n-1} (j + x), \quad S_1(n,k) = [x^{n-k}] \prod_{j=1}^{n-1} (1+j x).$$ The old Flint code simply used the forward OGF; the new code uses the forward OGF when $k < n / 2$ and the reverse OGF otherwise. The power series are expanded using binary splitting; here I managed to squeeze out some more performance by optimizing the basecase and by replacing the final multiplication with a dot product when only computing a single coefficient. Finally, the new code switches to using the "reverse EGF" (exponential generating function) $$S_1(n,k) = (k+1)_{n-k} [x^{n-k}] \left(-\frac{\log(1-x)}{x}\right)^k.$$ when $n$ is large and $k$ is very close to $n$ (experimentally determined cutoff: $n > 140$ and $k > 0.87n$). There is no corresponding elementary "forward EGF" allowing computation in $O(\log n)$ arithmetic operations for small $k$ and presumably such a generating function cannot exist since this would imply a speedup (in terms of arithmetic complexity) for factorials. The gamma function form of the OGF $$S_1(n,k) = [x^{k-1}] \exp\left(\log \Gamma(n+x) - \log \Gamma(1+x)\right)$$ can be used as a basis for numerical approximation algorithms, which I plan to cover at some point in the future, but I do not believe it is useful for exact computation.

Sample timings to compute $S_1(n,0), \ldots, S_1(n,k)$ at once using the vector function:

Number Flint (old) Flint (new)
$n = 10^2$ 0.000215 s 0.000118 s
$n = 10^3$ 0.0512 s 0.0296 s
$n = 10^4$ 9.76 s 6.20 s

Sample timings to compute isolated Stirling numbers $S_1(n,k)$:

Number Flint (old) Flint (new)
$n = 1000, k = 250$ 0.0214 s 0.0106 s
$n = 1000, k = 500$ 0.0342 s 0.0112 s
$n = 1000, k = 750$ 0.0418 s 0.00957 s
$n = 10000, k = 2500$ 5.21 s 3.13 s
$n = 10000, k = 5000$ 7.71 s 3.09 s
$n = 10000, k = 7500$ 7.64 s 2.86 s

### Of the second kind

For Stirling numbers of the second kind, we have the forward OGF and reverse OGFs $$S_2(n,k) = [x^k] \left( e^{-x} \sum_{k=0}^{\infty} \frac{k^n}{k!} x^k\right), \quad S_2(n,k) = [x^{n-k}] \prod_{j=1}^k \frac{1}{1-j x}$$ and the reverse EGF $$S_2(n,k) = (k+1)_{n-k} [x^{n-k}] \left(\frac{\exp(x)-1}{x}\right)^k.$$ When computing a single value, the forward OGF turns into single sum $$S_2(n,k) = \frac{1}{k!} \sum_{j=0}^{k} (-1)^{j} \binom{k}{j} (k-j)^n.$$ The old code in Flint simply used this sum to compute isolated $S_2(n,k)$ values. The new code uses the triangular recurrence, the sum, multimodular evaluation of the sum, or the EGF (using some rather messy cutoffs to determine the best method). The multimodular algorithm performs significantly better than direct evaluation of the sum when both $n$ and $k$ are large in part because the final result $S_2(n,k)$ is much smaller than the terms of the sum over $\mathbb{Z}$. Vector evaluation previously used the triangular recurrence; the new code uses the recurrence, the forward OGF (evaluated as a single convolution), or the forward OGF with multimodular evaluation. I did not find any use for the reverse OGF (the reverse EGF appears to be superior whenever $k$ is close to $n$).

Sample timings to compute $S_2(n,0), \ldots, S_2(n,k)$ at once using the vector function:

Number Flint (old) Flint (new)
$n = 10^2$ 0.000135 s 0.000115 s
$n = 10^3$ 0.0536 s 0.044 s
$n = 10^4$ 60.9 s 13.5 s

Sample timings to compute isolated Stirling numbers $S_2(n,k)$:

Number Flint (old) Flint (new)
$n = 1000, k = 250$ 0.000384 s 0.000363 s
$n = 1000, k = 500$ 0.00112 s 0.000985 s
$n = 1000, k = 750$ 0.00203 s 0.00187 s
$n = 10000, k = 2500$ 0.241 s 0.137 s
$n = 10000, k = 5000$ 0.625 s 0.186 s
$n = 10000, k = 7500$ 1.10 s 0.15 s
$n = 100000, k = 25000$ 83.9 s 14.9 s
$n = 100000, k = 50000$ 212 s 21.3 s
$n = 100000, k = 75000$ 358 s 17.1 s

I'm quite happy with these improvements.

## Partition numbers

For completeness, I will also discuss the partition function $p(n)$ although I have made no recent attempts to improve this code. Arb implements the Hardy-Ramanujan-Rademacher formula $$p(n) = \frac{2 \pi}{{\left(24 n - 1\right)}^{3 / 4}} \sum_{k=1}^{\infty} \frac{A\!\left(n, k\right)}{k} I_{3 / 2}\!\left(\frac{\pi}{k} \sqrt{\frac{2}{3} \left(n - \frac{1}{24}\right)}\right)$$ which I have already written about extensively, for example in this paper and previously on this blog. This algorithm is quasioptimal both in theory and in practice: for large $n$, the time to compute $p(n)$ is roughly twice that of just computing the leading term $\exp(\pi \sqrt{2 n / 3}) / (4 n \sqrt{3})$ to $\log_2 p(n)$ bits of precision, and this looks really tough to beat.

Some small constant factor could probably be saved by optimizing basic Arb functions. The code for roots of unity (or trigonometric constants $\cos(\pi m/n)$) is quite old and could be tightened a bit, which should speed up evaluation of the exponential sums $A_k(n)$. Some ideas: use higher-order Newton iteration, use floating-point code with optimal precision management, and optimize the selection between Taylor series, trigonometric minimal polynomials, complex arithmetic and Chebyshev polynomials (currently unused). Other low-level optimizations might be possible for small $n$: for example, choosing series truncations and precision levels more carefully.

The Arb code already uses one cute trick (which I implemented in 2016) for small $n$: for $n < 1200$, I just compute the low bits $p(n)$ mod $2^{64}$ using the Euler pentagonal recurrence and combine them with a leading-term asymptotic approximation for the high bits (note that $p(n) < 2^{128}$ for $n \le 1458$). Unfortunately, this trick does not scale because computing $p(n) \bmod m$ using modular arithmetic (even using FFT) is asymptotically slower than computing the exact value using the Hardy-Ramanujan-Rademacher formula. The cost could be amortized by precomputing a table of $p(n) \bmod m$ values for several $n$, but then one might as well tabulate the exact $p(n)$ values using a single power series inversion over $\mathbb{Z}$.