fredrikj.net / blog /
Arb 2.23 released
June 29, 2022
Just in time to wrap up this academic semester, I've tagged version 2.23.0 of Arb (files, changes). This post will review the most notable improvements.
Contents
Lerch transcendent
Arb 2.23 adds the function acb_dirichlet_lerch_phi for computing the Lerch transcendent, the analytic continuation of the series $$\Phi(z,s,a) = \sum_{n=0}^{\infty} \frac{z^n}{(n+a)^s}.$$
See the previous post Computing the Lerch transcendent for details. A machine-precision wrapper is also available in the fpwrap module (thanks Valentin Boettcher).
Performance improvements
Arb 2.23 features some of the most notable speed improvements in the history of the library.
Multithreading
At huge precision, multiplications, mathematical constants and elementary functions are now multithreaded internally. On an 8-core laptop, it is typical to see a 3x to 5x parallel speedup at millions of digits, with smaller speedups starting from around 10,000 digits. Other operations that are now multithreaded include numerical integration (helpful in low precision as well!), construction of Hilbert class polynomials and Bernoulli number generation.
Previously discussed in Parallelizing Arb for fun and profit and Multithreaded elementary functions.
Note: the user must call flint_set_num_threads(T) to enable use of T cores. Some speedups will only appear if you use the latest FLINT git version instead of the FLINT 2.9 release (there will be parallel speedups if you use FLINT 2.9, just not as large as shown below).
New algorithm for elementary functions
Elementary functions use a new algorithm at precision between roughly 1000 and 1,000,000 digits. The new algorithm improves single-threaded performance substantially and has the nice side effect of helping parallel efficiency as well. A precomputation is involved: on one core, the first call to elementary functions will be a bit slower than before (up to about 2x slower) while all subsequent calls will be faster (up to about 2x faster than before, with the biggest speedup around 10,000 digits). The assumption here is that most applications will want to evaluate functions several times. In any case, the precomputation is multithreaded, and with multiple cores even the first call will be faster than before. I will give more details about the new algorithm in a forthcoming paper.
Arb no longer uses MPFR to compute logarithms at high precision since the new algorithm is much faster than MPFR's AGM implementation. A nice consequence is that the first call to log is faster than before even on one thread, because we avoid the overhead of MPFR precomputing log(2).
Complex squaring
A small trick had been overlooked in Arb's basic arithmetic: complex squaring now uses a Karatsuba-style formula at high precision (performing three real squarings instead of two squarings and a multiplication), making it about 10% faster.
Benchmarketing
- The tables show the elapsed time in seconds using (Arb 2.23 + git trunk of FLINT) on an AMD Ryzen 7 PRO 5850U (Zen3).
- The numbers (1.23x) in parentheses show the speedup over (Arb 2.22.1 + FLINT 2.8.5) on the same machine.
- T is the number of threads (1 or 8).
- First/repeat shows the timing for a first function call (including precomputations) versus a repeated function call (precomputations already done).
Mathematical constants benefit from multithreading:
const_pi, n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
10000 | 0.001 (1.00x) | - | 0.001 (1.00x) | - |
100000 | 0.013 (1.00x) | - | 0.01 (1.30x) | - |
1000000 | 0.225 (1.06x) | - | 0.126 (1.88x) | - |
10000000 | 4.14 (1.02x) | - | 1.49 (2.87x) | - |
100000000 | 70.6 (1.04x) | - | 22.4 (3.28x) | - |
const_euler, n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
10000 | 0.016 (1.06x) | - | 0.01 (1.60x) | - |
100000 | 0.397 (1.04x) | - | 0.17 (2.43x) | - |
1000000 | 7.45 (1.05x) | - | 2.43 (3.23x) | - |
10000000 | 145 (1.06x) | - | 36.1 (4.27x) | - |
Elementary functions benefit from the new algorithm as well as multithreading:
exp(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.67 · 10−7 (0.99x) | - | 4.69 · 10−7 (0.99x) |
1000 | - | 1.84 · 10−5 (1.56x) | - | 1.84 · 10−5 (1.56x) |
10000 | 0.004 (0.50x) | 0.0008 (2.01x) | 0.001 (2.00x) | 0.00045 (3.60x) |
100000 | 0.107 (0.50x) | 0.0322 (1.63x) | 0.028 (1.89x) | 0.0112 (4.68x) |
1000000 | 2.19 (0.50x) | 0.724 (1.52x) | 0.486 (2.26x) | 0.186 (5.91x) |
10000000 | 20.7 (1.01x) | 20.4 (1.03x) | 4.53 (4.61x) | 4.59 (4.55x) |
log(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.73 · 10−7 (0.98x) | - | 4.72 · 10−7 (0.98x) |
1000 | - | 2.56 · 10−5 (1.22x) | - | 2.55 · 10−5 (1.22x) |
10000 | 0.004 (1.00x) | 0.000894 (1.82x) | 0.002 (2.00x) | 0.00055 (2.95x) |
100000 | 0.116 (1.22x) | 0.0371 (1.67x) | 0.032 (4.44x) | 0.0157 (3.96x) |
1000000 | 2.35 (1.20x) | 0.798 (1.49x) | 0.543 (5.21x) | 0.227 (5.24x) |
10000000 | 22.7 (2.28x) | 21.3 (1.00x) | 5.05 (10.26x) | 4.87 (4.39x) |
sin(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 5.58 · 10−7 (0.98x) | - | 5.56 · 10−7 (0.97x) |
1000 | - | 2.28 · 10−5 (1.44x) | - | 2.26 · 10−5 (1.45x) |
10000 | 0.004 (0.50x) | 0.000915 (1.93x) | 0.002 (0.50x) | 0.000912 (1.94x) |
100000 | 0.131 (0.63x) | 0.0505 (1.63x) | 0.039 (2.13x) | 0.0216 (3.82x) |
1000000 | 2.86 (0.58x) | 1.27 (1.30x) | 0.74 (2.23x) | 0.414 (3.99x) |
10000000 | 30.9 (1.00x) | 30.6 (1.01x) | 7.14 (4.33x) | 7.31 (4.21x) |
atan(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.79 · 10−7 (0.99x) | - | 4.79 · 10−7 (0.99x) |
1000 | - | 3.6 · 10−5 (0.99x) | - | 3.59 · 10−5 (0.99x) |
10000 | 0.005 (0.60x) | 0.00115 (2.18x) | 0.003 (1.00x) | 0.00115 (2.18x) |
100000 | 0.139 (0.63x) | 0.0549 (1.56x) | 0.047 (1.83x) | 0.0287 (2.98x) |
1000000 | 3.04 (0.62x) | 1.37 (1.36x) | 0.802 (2.33x) | 0.468 (4.00x) |
10000000 | 32.3 (1.12x) | 30.4 (1.20x) | 8.2 (4.45x) | 8.07 (4.52x) |
Even special functions that have not been changed in this release benefit indirectly from faster core functions:
erf(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.73 · 10−6 (1.02x) | - | 4.74 · 10−6 (1.01x) |
1000 | - | 7.53 · 10−5 (1.14x) | - | 7.51 · 10−5 (1.14x) |
10000 | 0.008 (0.75x) | 0.00409 (1.24x) | 0.006 (1.00x) | 0.00435 (1.17x) |
100000 | 0.369 (0.87x) | 0.282 (1.09x) | 0.284 (1.13x) | 0.259 (1.19x) |
1000000 | 11.9 (1.03x) | 11.6 (1.03x) | 10.6 (1.92x) | 10.5 (1.94x) |
gamma(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.51 · 10−6 (1.03x) | - | 4.51 · 10−6 (1.03x) |
1000 | - | 0.000231 (1.04x) | - | 0.00023 (1.04x) |
10000 | 0.107 (1.07x) | 0.0487 (1.07x) | 0.065 (1.74x) | 0.0492 (1.05x) |
100000 | 20.4 (1.02x) | 9.18 (1.04x) | 12.3 (1.72x) | 9.19 (1.03x) |
zeta(x), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 7.72 · 10−5 (1.03x) | - | 7.71 · 10−5 (1.03x) |
1000 | 0.011 (1.00x) | 0.00658 (1.15x) | 0.009 (1.22x) | 0.00658 (1.15x) |
10000 | 2.33 (1.30x) | 1.49 (1.46x) | 1.4 (2.16x) | 1.21 (1.79x) |
eta(ix), n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 4.2 · 10−6 (1.03x) | - | 4.2 · 10−6 (1.03x) |
1000 | - | 5.47 · 10−5 (1.19x) | - | 5.41 · 10−5 (1.20x) |
10000 | 0.006 (0.67x) | 0.0022 (1.44x) | 0.004 (1.00x) | 0.00187 (1.67x) |
100000 | 0.201 (0.74x) | 0.113 (1.19x) | 0.116 (1.27x) | 0.091 (1.47x) |
1000000 | 4.77 (0.86x) | 3.03 (1.26x) | 2.14 (1.90x) | 1.72 (2.22x) |
Bernoulli numbers and partitions benefit from multithreading and a little bit from faster elementary functions:
bernoulli(n) | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 2.11 · 10−6 (0.97x) | - | 2.11 · 10−6 (0.96x) |
1000 | - | 4.76 · 10−5 (0.99x) | - | 4.77 · 10−5 (0.99x) |
10000 | 0.009 (1.00x) | 0.0067 (0.99x) | 0.007 (1.29x) | 0.0043 (1.53x) |
100000 | 0.473 (1.00x) | 0.42 (1.00x) | 0.19 (2.48x) | 0.139 (3.02x) |
1000000 | 20.8 (1.01x) | 19.6 (1.02x) | 4.82 (4.42x) | 4.36 (4.66x) |
partitions(n^2) | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | - | 5.15 · 10−5 (0.96x) | - | 2.9 · 10−5 (1.70x) |
1000 | 0.001 (1.00x) | 0.000444 (0.97x) | - | 0.000169 (2.54x) |
10000 | 0.011 (0.91x) | 0.00613 (1.13x) | 0.004 (2.50x) | 0.00192 (3.61x) |
100000 | 0.218 (0.95x) | 0.109 (1.20x) | 0.061 (2.66x) | 0.032 (5.19x) |
1000000 | 4.07 (0.75x) | 2.08 (1.32x) | 0.981 (2.13x) | 0.49 (4.29x) |
10000000 | 58.4 (1.10x) | 52.7 (1.12x) | 14.4 (2.84x) | 12.6 (3.30x) |
Numerical integration benefits from faster integrand evaluation as well as multithreading:
integral $\int_0^8 \sin(x+e^x) dx$, n digits | ||||
n | T = 1, first | T = 1, repeat | T = 8, first | T = 8, repeat |
---|---|---|---|---|
100 | 0.015 (1.00x) | 0.00662 (1.00x) | 0.005 (3.00x) | 0.0029 (2.28x) |
1000 | 2.35 (1.03x) | 0.452 (1.16x) | 0.385 (6.25x) | 0.078 (6.74x) |
10000 | 1735 (1.06x) | 65.9 (1.98x) | 309 (5.92x) | 12.4 (10.51x) |
New example programs
There are some new programs in the examples directory, mainly for benchmarking use.
The new zeta_zeros program (contributed by D.H.J. Polymath) deserves a mention as it exposes Arb's ability to compute zeros of the Riemann zeta function in a convenient way. This also provides a nice opportunity to demonstrate multithreading. Arb can now compute the first 1000 zeros to 100 digits in less than one second:
> build/examples/zeta_zeros -count 1000 -digits 100 -threads 8 1 14.13472514173469379045725198356247027078425711569924317568556746014996342980925676494901039317156101 2 21.02203963877155499262847959389690277733434052490278175462952040358759858606889079971365851418015142 3 25.01085758014568876321379099256282181865954967255799667249654200674509209844164427784023822455806244 4 30.42487612585951321031189753058409132018156002371544018096214603699332938933327792029058429390208911 5 32.93506158773918969066236896407490348881271560351703900928000344078481560863055100593884849613534872 6 37.58617815882567125721776348070533282140559735083079321833300111362214908961853726473032910494582380 7 40.91871901214749518739812691463325439572616596277727953616130366725328052872007128299600371988954688 8 43.32707328091499951949612216540680578264566837183687144687889368552108832230505362645634937106319093 9 48.00515088116715972794247274942751604168684400114442511777531251981409021641630828133033537230540100 10 49.77383247767230218191678467856372405772317829967666210078195575043351161151573927873270750740093133 (...) 990 1408.136307496189550754764109216568713802973074401596082938023910153448087197246053254726685256863068 991 1409.320681079838957249711182852555964218419074619083658769074873716876732637748098222317480124140125 992 1410.024810725802225833083299431548658209139288471966299854738251058890993466491878130140068369118039 993 1411.257056815706623860024230778695375047885978212577565439005409910857235937559194796005426158510609 994 1411.965653461773004284494177417113415026254906000578127153945897366191802397334861957341376917729516 995 1413.843148788568852998797603639966241538013462044943716176541781285872641912683298566680983174469704 996 1415.585784795495351208993038535815851149152803073425973256120338716754265388876013725524718210637033 997 1415.781581303282910082772036660394177733609861547071641851390571859089124614459378633634824095333207 998 1417.102822933822614755624964325648538469069477451958340814458614067897632037656189836070211970142432 999 1418.696963852452165675233051829874743057694774871561834399141180387991829036797761712004334342926105 1000 1419.422480945995686465989038079916819232100601064166016304690814684608676417593010417911343291179210 cpu/wall(s): 5.36 0.84 virt/peak/res/peak(MB): 525.91 525.92 12.00 12.00
Compute 10 zeros starting with index 1012:
> build/examples/zeta_zeros -n 1000000000000 -count 10 -threads 8 -digits 30 1000000000000 267653395648.625948242142649409 1000000000001 267653395648.847523129021300806 1000000000002 267653395649.362366968656043775 1000000000003 267653395649.681630916477931951 1000000000004 267653395649.861989944051275225 1000000000005 267653395650.157665479031164314 1000000000006 267653395650.434266684414753381 1000000000007 267653395650.580891300763755538 1000000000008 267653395650.834479534143654039 1000000000009 267653395651.058482002884137266 cpu/wall(s): 16.766 12.978 virt/peak/res/peak(MB): 468.57 468.57 35.19 35.44
Miscellaneous
Multiplying an infinity by a positive or negative number now yields a signed infinity instead of $[0 \pm \infty]$ (contributed by Erik Postma).
There is now a module (fmpzi) for efficient arithmetic in the ring $\mathbb{Z}[i]$ of Gaussian integers. At the moment, this module only does basic arithmetic (addition, multiplication and powering), which was needed in the implementation of elementary functions; in the future, it will be expanded with functionality like divisions, GCD and factoring.
fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor