fredrikj.net / blog /
A new gamma function implementation
February 11, 2010
The gamma function is probably the most important nonelementary special function. The gamma function or ratios of gamma functions appear in everything from the functional equation of the Riemann zeta function to the normalization factors of hypergeometric functions. Even expressions that are usually rational functions (such as binomial coefficients) are often most conveniently expressed (or in software, implemented) in terms of the gamma function since this automatically provides correct asymptotics and extension to noninteger parameters. Needless to say, a well-implemented gamma function is fundamental for a special functions library.
The gamma function implementation used in mpmath until now is one of the oldest parts of the code (I wrote it around three years ago), and although it has done its job, it has some flaws. It mainly uses Spouge’s approximation, which is fairly efficient but nonoptimal for asymptotically large arguments. The implementation also has some subtle precision issues.
The new gamma function added in this commit is the result of over a year of intermittent work. I did most of the work during last summer but didn’t have time to finish it until recently. The improvements are essentially the following:
- Removal of various overheads
- Special-casing half-integers
- Using Taylor series for small real arguments
- Using Stirling’s series for complex and/or large arguments
- Optimizing for computing loggamma
- Handling values near poles and near the real axis extremely carefully
The difference in performance is quite dramatic. The smallest speedup occurs for small complex arguments, but it turns out that even there, the Stirling series is faster than Spouge’s approximation (at least with the removal of overheads in the new implementation) despite the need to perform argument transformations. For real arguments, and large complex arguments, the new implementation is consistently faster by a large factor, especially for loggamma.
I’ll show some benchmarks. The times are in milliseconds; the first value in each cell is the time for the old implementation and the second is the time for the new one, followed by the speedup. The results are on my laptop, with gmpy (I’ll try with Sage later):
Firstly, results for gamma(z):
z | dps=15 | dps=50 | dps=250 | dps=1000 |
3.0 | 0.0086 0.0077 (1.13x) |
0.0089 0.0076 (1.16x) |
0.0087 0.0075 (1.16x) |
0.0085 0.0077 (1.12x) |
5.27 | 0.1200 0.0342 (3.51x) |
0.2513 0.0545 (4.61x) |
2.5964 0.4220 (6.15x) |
80.7089 11.3011 (7.14x) |
2.5 | 0.1255 0.0158 (7.94x) |
0.2398 0.0138 (17.41x) |
2.5519 0.0144 (177.72x) |
79.0651 0.0149 (5319.87x) |
-3.7 | 0.2518 0.0457 (5.51x) |
0.4006 0.0615 (6.52x) |
3.1073 0.4283 (7.26x) |
87.4200 11.3981 (7.67x) |
(-3.0+4.0j) | 0.6205 0.3942 (1.57x) |
1.0095 0.6093 (1.66x) |
9.0896 5.9099 (1.54x) |
253.9219 191.3358 (1.33x) |
(3.25-4.125j) | 0.3548 0.2864 (1.24x) |
0.7456 0.4791 (1.56x) |
8.6651 5.6367 (1.54x) |
252.7598 189.2187 (1.34x) |
(15.0+20.0j) | 0.3980 0.1973 (2.02x) |
0.7526 0.4496 (1.67x) |
8.6583 5.4604 (1.59x) |
250.8646 188.5849 (1.33x) |
52.1 | 0.1661 0.0586 (2.83x) |
0.3050 0.0894 (3.41x) |
2.9862 0.5685 (5.25x) |
84.8334 12.4169 (6.83x) |
123.25 | 0.1643 0.0638 (2.57x) |
0.3071 0.0963 (3.19x) |
2.8908 0.9194 (3.14x) |
83.8168 14.6588 (5.72x) |
-200.7 | 0.3011 0.1485 (2.03x) |
0.4897 0.1914 (2.56x) |
3.5145 1.1359 (3.09x) |
89.8617 17.5500 (5.12x) |
(100.25+100.0j) | 0.4238 0.1857 (2.28x) |
0.7641 0.2757 (2.77x) |
8.6807 3.1403 (2.76x) |
249.6731 171.4369 (1.46x) |
12345678.9 | 0.3032 0.0701 (4.32x) |
0.4382 0.0857 (5.11x) |
3.4078 0.3257 (10.46x) |
88.4082 5.2828 (16.74x) |
(0.0+12345678.9j) | 0.7814 0.1409 (5.54x) |
1.1671 0.1802 (6.48x) |
10.3985 0.6169 (16.86x) |
265.6192 8.3569 (31.78x) |
1.0e+20 | 0.8833 0.0798 (11.06x) |
1.3019 0.0899 (14.48x) |
6.4010 0.2939 (21.78x) |
107.0924 3.5141 (30.47x) |
(0.0+1.0e+20j) | 3.5025 0.1515 (23.11x) |
3.6163 0.1834 (19.72x) |
19.5676 0.5308 (36.86x) |
322.9033 6.2148 (51.96x) |
Secondly, results for loggamma(z):
z | dps=15 | dps=50 | dps=250 | dps=1000 |
3.0 | 0.0700 0.0115 (6.07x) |
0.0677 0.0116 (5.84x) |
0.0683 0.0119 (5.73x) |
0.0683 0.0121 (5.65x) |
5.27 | 0.2207 0.0564 (3.91x) |
0.3548 0.0796 (4.45x) |
2.8925 0.5221 (5.54x) |
88.3556 12.6452 (6.99x) |
2.5 | 0.2210 0.0319 (6.93x) |
0.3565 0.0374 (9.53x) |
2.8595 0.1011 (28.28x) |
81.8333 1.3546 (60.41x) |
-3.7 | 0.4392 0.0782 (5.62x) |
0.6418 0.0987 (6.50x) |
3.5767 0.5429 (6.59x) |
88.8484 12.7363 (6.98x) |
(-3.0+4.0j) | 1.1235 0.4942 (2.27x) |
1.5685 0.7224 (2.17x) |
10.0879 6.0459 (1.67x) |
267.2893 198.7613 (1.34x) |
(3.25-4.125j) | 0.8789 0.2704 (3.25x) |
1.2326 0.4752 (2.59x) |
9.7501 5.4572 (1.79x) |
264.7649 190.4007 (1.39x) |
(15.0+20.0j) | 0.8983 0.1248 (7.20x) |
1.2923 0.4221 (3.06x) |
9.7917 5.2766 (1.86x) |
265.2711 187.9775 (1.41x) |
52.1 | 0.2604 0.0812 (3.21x) |
0.4186 0.1140 (3.67x) |
3.2493 0.6890 (4.72x) |
87.0111 13.9280 (6.25x) |
123.25 | 0.2629 0.0478 (5.50x) |
0.4227 0.0692 (6.11x) |
3.1110 1.0216 (3.05x) |
85.9028 15.9917 (5.37x) |
-200.7 | 0.5439 0.1631 (3.34x) |
0.7465 0.2221 (3.36x) |
3.9305 1.2537 (3.14x) |
92.5380 18.8309 (4.91x) |
(100.25+100.0j) | 0.9103 0.1179 (7.72x) |
1.3332 0.1849 (7.21x) |
10.1655 3.1083 (3.27x) |
266.3565 170.3099 (1.56x) |
12345678.9 | 0.4026 0.0491 (8.21x) |
0.5439 0.0525 (10.36x) |
3.7183 0.2051 (18.13x) |
89.9402 4.0044 (22.46x) |
(0.0+12345678.9j) | 1.2628 0.0765 (16.52x) |
1.7085 0.0858 (19.91x) |
11.5023 0.2749 (41.84x) |
275.6174 4.4548 (61.87x) |
1.0e+20 | 0.9820 0.0495 (19.84x) |
1.3882 0.0540 (25.72x) |
6.6983 0.1481 (45.21x) |
110.1757 2.1954 (50.19x) |
(0.0+1.0e+20j) | 4.0385 0.0788 (51.24x) |
4.3758 0.0833 (52.54x) |
20.9478 0.1936 (108.20x) |
335.9407 2.3911 (140.50x) |
All times are from a warm cache. The new implementation has slightly longer precomputation time (for the Taylor series coefficients) at high precision, and the Taylor series therefore isn’t used above a certain precision (around 1000 digits) despite being much faster on subsequent calls. I will probably add some user-visible function for controlling this.
Finally, several optimizations are possible still. The most important would be to do the implementation in Cython. (Improving the underlying elementary functions will also speed up the gamma function.) Potential algorithmic improvements include faster handling of near-integers (using Taylor series with a reduced number of terms), avoiding the reflection formula for complex arguments in the left half-plane but not too close to the negative half-axis, and performing the argument transformations with a reduced number of multiplications. But I doubt I will have time (or interest — the present code was demanding enough to finish) to work of any of these things, at least any time soon.
The speedups naturally also affect performance of many other functions, e.g. hypergeometric functions; I might post some benchmarks of that later.
fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor