fredrikj.net / blog /

Implementing log-gamma using the Stirling series

September 19, 2012

The algorithm for the gamma function described in the previous post is only useful when the argument is a small rational (or algebraic) number and one is interested in tens of thousands of digits.

Candidates for computing the gamma (or log-gamma) function of a generic argument include Stirling’s series, Lanczos’s approximation, and Spouge’s approximation (there are some other viable methods, such as contour integration, but they likely perform poorly in arbitrary-precision arithmetic and they are harder to give error bounds for).

These methods have a common form: they approximate gamma or log-gamma by a simple exponential or logarithmic asymptotic formula, adding a correction in the form of a parameter-dependent rational function or series. The series needs to be expanded to order $O(n)$ to give an accuracy of $n$ digits, so the cost for $n$-digit accuracy is $O(n)$ multiplications, or $O(n^2 \log^{1+\varepsilon} n)$ time. One also needs to generate the coefficients of the series, which is more or less difficult depending on the choice of formula (however, the coefficients can be cached for repeated calls, so the coefficients generally only need to be computed once for a given precision).

What are the pros and cons of each formula? Spouge’s and Lanczos’s approximations have the advantage of providing uniform accuracy (even for complex arguments), whereas Stirling’s series is asymptotic, so one has to perform $O(n)$ additional multiplications to reduce small arguments to larger ones. On the other hand, Stirling’s series clearly wins when the argument is large (and this is a quite common use case). In fact, Stirling’s series is always the best choice, even for small input: with parameters chosen optimally, Stirling’s series requires somewhat less than the equivalent of $n$ multiplications even in the worst case, while Spouge’s approximation always requires $1.3n$ divisions (or $2.6$ multiplications if one clears denominators).

When accounting for coefficient generation, Spouge’s approximation is quite attractive: the coefficients are given by a simple, explicit formula containing only exponentials and square roots, and they can even be generated on the fly without sacrificing too much performance. Stirling’s series requires Bernoulli numbers, and while there are simple schemes to generate them, computing thousands of Bernoulli numbers efficiently is more involved. Fortunately, I’ve already gone through the trouble of writing such very involved code for FLINT, so this is not a problem!

Lanczos’s approximation is similar to Spouge’s approximation, and seems to require slightly fewer terms at least at low precision. Unfortunately, selecting parameters to achieve a desired level of accuracy seems difficult; I don’t think it’s even known whether the method truly scales to arbitrary precision. The Lanczos coefficients are also relatively complicated to generate.

To make a long story short, I’ve implemented the log gamma function for real $x$ using Stirling’s series in Arb. This is just a first version of the code: it needs some cleanup, tests and documentation, and bug fixes for a few special cases. It uses a couple of tricks: it caches Bernoulli numbers, it changes the precision dynamically in the series evaluation, and the rising factorial in the argument reduction is evaluated using a trick that cuts time almost in half.

I’ve already mentioned that Stirling’s series requires selecting parameters. One must choose parameters $r$ and $n$, perform argument reduction to replace the input $x$ by $x + r$, and then evaluate the asymptotic series truncated to order $n$. The cost is roughly proportional to $r + n$, so one wants to minimize this expression subject to the constraint that the asymptotic series gives desired accuracy; large $r$ gives faster convergence, requiring smaller $n$, and vice versa. It turns out that there is a very easy way to choose parameters nearly optimally: one chooses $r$ such that $x + r$ is at least a fixed constant multiple $\beta$ of the precision in bits, and then selects the smallest $n$ that gives full accuracy. It’s easy to show (by applying Stirling’s formula recursively to the error term in Stirling’s formula!) that the constant $\beta$ must satisfy $\beta > \log(2) / (2\pi) \approx 0.11$, but the optimal, implementation-dependent value is slightly larger (I found 0.2 to work well in both mpmath and Arb). Note that while the parameter is chosen heuristically, the subsequent error bounding is done rigorously.

The following table shows timings in milliseconds for evaluating $\log \Gamma(3.7)$, where 3.7 is a full-precision floating-point number (i.e. not input as the exact fraction 37/10). This is done on a 32-bit system, as I wanted to compare with Mathematica.

Digits Arb mpmath 0.17 Pari/GP 2.5.1 MPFR 3.1.0 Mathematica 8.0
10 0.048 0.027 0.18 0.040
30 0.079 0.037 0.34 0.070
100 0.22 0.093 1.3 0.19
300 0.76 0.46 12 1.0
1000
(repeated)
50
6.5
3000
3.9
240
16
240 110
3000
(repeated)
1300
80
5400
220
5500
140
4300 1400
10000
(repeated)
31000
1200
190000
4200
200000
2000
160000 24000

At 1000 digits and above, the table includes separate timings the first evaluation and repeated evaluations, showing the significance of caching Bernoulli numbers. MPFR and Mathematica apparently do not cache Bernoulli numbers; at least not that many of them. Mathematica is presumably faster from cold cache at 10000 digits because it optimizes the parameter selection for this case; tweaking the parameters in Arb to compute fewer Bernoulli numbers, I can get the first evaluation down to 10 seconds, but this also increases the time for subsequent calls to about 3 seconds. It’s an interesting tradeoff.

It’s important to point out that the computations are not entirely equivalent: Arb takes an interval as input and computes a rigorous output interval. MPFR, of course, guarantees correct rounding, and it’s not hard to do the error propagation by hand since the log gamma function is monotonic and its derivative can be bounded easily. But in Arb one gets this for free (the code could even be optimized in the future by converting to an exact number internally). Mathematica performs error propagation, but only heuristically.

Below approximately 1000 digits, the Arb timings are bogged down by overhead, and there should be a speedup of up to 10x when I get around to rewriting the relevant core functions (a multiplication is literally 10x too slow at 10-digit precision right now, and still 2x too slow at 300-digit precision). I did not time Pari/GP at low precision, because I’m too GP-illiterate to get accurate timings out of it. Anyhow, Pari/GP clearly does quite well, presumably so at low precision as well.

One can infer from the table that I optimized the mpmath implementation quite extensively :-) Actually, up to around 1000 digits, mpmath wins over the C/C++ libraries mostly thanks to using an algorithm based on Taylor series in this precision range — provided that the argument is small. The Taylor series is asymptotically superior to the Stirling/Spouge/Lanczos expansions, only requiring $O(n / \log n)$ multiplications, but requires $O(x)$ extra multiplications for large $x$, and has the drawback that the coefficient precomputation is complicated and expensive (much worse than computing Bernoulli numbers). However, Arb largely knows how to do this precomputation now, so I will soon be able to implement this algorithm and compare.

There is actually an even faster algorithm for the gamma function by Peter Borwein, requiring only $O(n^{1/2} \log^2 n)$ multiplications. Unlike the Taylor series method, the big-oh constant is large, so it’s probably only competitive at precisions well beyond 1000 or perhaps even 10000 digits. I expect that I will be able to try out this algorithm in the near future as well, as soon as I’ve implemented some needed polynomial operations.