fredrikj.net / blog /

Arb 2.23 released

June 29, 2022

Just in time to wrap up this academic semester, I've tagged version 2.23.0 of Arb (files, changes). This post will review the most notable improvements.

Contents

Lerch transcendent

Arb 2.23 adds the function acb_dirichlet_lerch_phi for computing the Lerch transcendent, the analytic continuation of the series $$\Phi(z,s,a) = \sum_{n=0}^{\infty} \frac{z^n}{(n+a)^s}.$$

See the previous post Computing the Lerch transcendent for details. A machine-precision wrapper is also available in the fpwrap module (thanks Valentin Boettcher).

Performance improvements

Arb 2.23 features some of the most notable speed improvements in the history of the library.

Multithreading

At huge precision, multiplications, mathematical constants and elementary functions are now multithreaded internally. On an 8-core laptop, it is typical to see a 3x to 5x parallel speedup at millions of digits, with smaller speedups starting from around 10,000 digits. Other operations that are now multithreaded include numerical integration (helpful in low precision as well!), construction of Hilbert class polynomials and Bernoulli number generation.

Previously discussed in Parallelizing Arb for fun and profit and Multithreaded elementary functions.

Note: the user must call flint_set_num_threads(T) to enable use of T cores. Some speedups will only appear if you use the latest FLINT git version instead of the FLINT 2.9 release (there will be parallel speedups if you use FLINT 2.9, just not as large as shown below).

New algorithm for elementary functions

Elementary functions use a new algorithm at precision between roughly 1000 and 1,000,000 digits. The new algorithm improves single-threaded performance substantially and has the nice side effect of helping parallel efficiency as well. A precomputation is involved: on one core, the first call to elementary functions will be a bit slower than before (up to about 2x slower) while all subsequent calls will be faster (up to about 2x faster than before, with the biggest speedup around 10,000 digits). The assumption here is that most applications will want to evaluate functions several times. In any case, the precomputation is multithreaded, and with multiple cores even the first call will be faster than before. I will give more details about the new algorithm in a forthcoming paper.

Arb no longer uses MPFR to compute logarithms at high precision since the new algorithm is much faster than MPFR's AGM implementation. A nice consequence is that the first call to log is faster than before even on one thread, because we avoid the overhead of MPFR precomputing log(2).

Complex squaring

A small trick had been overlooked in Arb's basic arithmetic: complex squaring now uses a Karatsuba-style formula at high precision (performing three real squarings instead of two squarings and a multiplication), making it about 10% faster.

Benchmarketing

Mathematical constants benefit from multithreading:

const_pi, n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
10000 0.001 (1.00x) - 0.001 (1.00x) -
100000 0.013 (1.00x) - 0.01 (1.30x) -
1000000 0.225 (1.06x) - 0.126 (1.88x) -
10000000 4.14 (1.02x) - 1.49 (2.87x) -
100000000 70.6 (1.04x) - 22.4 (3.28x) -
const_euler, n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
10000 0.016 (1.06x) - 0.01 (1.60x) -
100000 0.397 (1.04x) - 0.17 (2.43x) -
1000000 7.45 (1.05x) - 2.43 (3.23x) -
10000000 145 (1.06x) - 36.1 (4.27x) -

Elementary functions benefit from the new algorithm as well as multithreading:

exp(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.67 · 10−7 (0.99x) - 4.69 · 10−7 (0.99x)
1000 - 1.84 · 10−5 (1.56x) - 1.84 · 10−5 (1.56x)
10000 0.004 (0.50x) 0.0008 (2.01x) 0.001 (2.00x) 0.00045 (3.60x)
100000 0.107 (0.50x) 0.0322 (1.63x) 0.028 (1.89x) 0.0112 (4.68x)
1000000 2.19 (0.50x) 0.724 (1.52x) 0.486 (2.26x) 0.186 (5.91x)
10000000 20.7 (1.01x) 20.4 (1.03x) 4.53 (4.61x) 4.59 (4.55x)
log(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.73 · 10−7 (0.98x) - 4.72 · 10−7 (0.98x)
1000 - 2.56 · 10−5 (1.22x) - 2.55 · 10−5 (1.22x)
10000 0.004 (1.00x) 0.000894 (1.82x) 0.002 (2.00x) 0.00055 (2.95x)
100000 0.116 (1.22x) 0.0371 (1.67x) 0.032 (4.44x) 0.0157 (3.96x)
1000000 2.35 (1.20x) 0.798 (1.49x) 0.543 (5.21x) 0.227 (5.24x)
10000000 22.7 (2.28x) 21.3 (1.00x) 5.05 (10.26x) 4.87 (4.39x)
sin(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 5.58 · 10−7 (0.98x) - 5.56 · 10−7 (0.97x)
1000 - 2.28 · 10−5 (1.44x) - 2.26 · 10−5 (1.45x)
10000 0.004 (0.50x) 0.000915 (1.93x) 0.002 (0.50x) 0.000912 (1.94x)
100000 0.131 (0.63x) 0.0505 (1.63x) 0.039 (2.13x) 0.0216 (3.82x)
1000000 2.86 (0.58x) 1.27 (1.30x) 0.74 (2.23x) 0.414 (3.99x)
10000000 30.9 (1.00x) 30.6 (1.01x) 7.14 (4.33x) 7.31 (4.21x)
atan(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.79 · 10−7 (0.99x) - 4.79 · 10−7 (0.99x)
1000 - 3.6 · 10−5 (0.99x) - 3.59 · 10−5 (0.99x)
10000 0.005 (0.60x) 0.00115 (2.18x) 0.003 (1.00x) 0.00115 (2.18x)
100000 0.139 (0.63x) 0.0549 (1.56x) 0.047 (1.83x) 0.0287 (2.98x)
1000000 3.04 (0.62x) 1.37 (1.36x) 0.802 (2.33x) 0.468 (4.00x)
10000000 32.3 (1.12x) 30.4 (1.20x) 8.2 (4.45x) 8.07 (4.52x)

Even special functions that have not been changed in this release benefit indirectly from faster core functions:

erf(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.73 · 10−6 (1.02x) - 4.74 · 10−6 (1.01x)
1000 - 7.53 · 10−5 (1.14x) - 7.51 · 10−5 (1.14x)
10000 0.008 (0.75x) 0.00409 (1.24x) 0.006 (1.00x) 0.00435 (1.17x)
100000 0.369 (0.87x) 0.282 (1.09x) 0.284 (1.13x) 0.259 (1.19x)
1000000 11.9 (1.03x) 11.6 (1.03x) 10.6 (1.92x) 10.5 (1.94x)
gamma(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.51 · 10−6 (1.03x) - 4.51 · 10−6 (1.03x)
1000 - 0.000231 (1.04x) - 0.00023 (1.04x)
10000 0.107 (1.07x) 0.0487 (1.07x) 0.065 (1.74x) 0.0492 (1.05x)
100000 20.4 (1.02x) 9.18 (1.04x) 12.3 (1.72x) 9.19 (1.03x)
zeta(x), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 7.72 · 10−5 (1.03x) - 7.71 · 10−5 (1.03x)
1000 0.011 (1.00x) 0.00658 (1.15x) 0.009 (1.22x) 0.00658 (1.15x)
10000 2.33 (1.30x) 1.49 (1.46x) 1.4 (2.16x) 1.21 (1.79x)
eta(ix), n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 4.2 · 10−6 (1.03x) - 4.2 · 10−6 (1.03x)
1000 - 5.47 · 10−5 (1.19x) - 5.41 · 10−5 (1.20x)
10000 0.006 (0.67x) 0.0022 (1.44x) 0.004 (1.00x) 0.00187 (1.67x)
100000 0.201 (0.74x) 0.113 (1.19x) 0.116 (1.27x) 0.091 (1.47x)
1000000 4.77 (0.86x) 3.03 (1.26x) 2.14 (1.90x) 1.72 (2.22x)

Bernoulli numbers and partitions benefit from multithreading and a little bit from faster elementary functions:

bernoulli(n)
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 2.11 · 10−6 (0.97x) - 2.11 · 10−6 (0.96x)
1000 - 4.76 · 10−5 (0.99x) - 4.77 · 10−5 (0.99x)
10000 0.009 (1.00x) 0.0067 (0.99x) 0.007 (1.29x) 0.0043 (1.53x)
100000 0.473 (1.00x) 0.42 (1.00x) 0.19 (2.48x) 0.139 (3.02x)
1000000 20.8 (1.01x) 19.6 (1.02x) 4.82 (4.42x) 4.36 (4.66x)
partitions(n^2)
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 - 5.15 · 10−5 (0.96x) - 2.9 · 10−5 (1.70x)
1000 0.001 (1.00x) 0.000444 (0.97x) - 0.000169 (2.54x)
10000 0.011 (0.91x) 0.00613 (1.13x) 0.004 (2.50x) 0.00192 (3.61x)
100000 0.218 (0.95x) 0.109 (1.20x) 0.061 (2.66x) 0.032 (5.19x)
1000000 4.07 (0.75x) 2.08 (1.32x) 0.981 (2.13x) 0.49 (4.29x)
10000000 58.4 (1.10x) 52.7 (1.12x) 14.4 (2.84x) 12.6 (3.30x)

Numerical integration benefits from faster integrand evaluation as well as multithreading:

integral $\int_0^8 \sin(x+e^x) dx$, n digits
n T = 1, first T = 1, repeat T = 8, first T = 8, repeat
100 0.015 (1.00x) 0.00662 (1.00x) 0.005 (3.00x) 0.0029 (2.28x)
1000 2.35 (1.03x) 0.452 (1.16x) 0.385 (6.25x) 0.078 (6.74x)
10000 1735 (1.06x) 65.9 (1.98x) 309 (5.92x) 12.4 (10.51x)

New example programs

There are some new programs in the examples directory, mainly for benchmarking use.

The new zeta_zeros program (contributed by D.H.J. Polymath) deserves a mention as it exposes Arb's ability to compute zeros of the Riemann zeta function in a convenient way. This also provides a nice opportunity to demonstrate multithreading. Arb can now compute the first 1000 zeros to 100 digits in less than one second:

> build/examples/zeta_zeros -count 1000 -digits 100 -threads 8
1	14.13472514173469379045725198356247027078425711569924317568556746014996342980925676494901039317156101
2	21.02203963877155499262847959389690277733434052490278175462952040358759858606889079971365851418015142
3	25.01085758014568876321379099256282181865954967255799667249654200674509209844164427784023822455806244
4	30.42487612585951321031189753058409132018156002371544018096214603699332938933327792029058429390208911
5	32.93506158773918969066236896407490348881271560351703900928000344078481560863055100593884849613534872
6	37.58617815882567125721776348070533282140559735083079321833300111362214908961853726473032910494582380
7	40.91871901214749518739812691463325439572616596277727953616130366725328052872007128299600371988954688
8	43.32707328091499951949612216540680578264566837183687144687889368552108832230505362645634937106319093
9	48.00515088116715972794247274942751604168684400114442511777531251981409021641630828133033537230540100
10	49.77383247767230218191678467856372405772317829967666210078195575043351161151573927873270750740093133

(...)

990	1408.136307496189550754764109216568713802973074401596082938023910153448087197246053254726685256863068
991	1409.320681079838957249711182852555964218419074619083658769074873716876732637748098222317480124140125
992	1410.024810725802225833083299431548658209139288471966299854738251058890993466491878130140068369118039
993	1411.257056815706623860024230778695375047885978212577565439005409910857235937559194796005426158510609
994	1411.965653461773004284494177417113415026254906000578127153945897366191802397334861957341376917729516
995	1413.843148788568852998797603639966241538013462044943716176541781285872641912683298566680983174469704
996	1415.585784795495351208993038535815851149152803073425973256120338716754265388876013725524718210637033
997	1415.781581303282910082772036660394177733609861547071641851390571859089124614459378633634824095333207
998	1417.102822933822614755624964325648538469069477451958340814458614067897632037656189836070211970142432
999	1418.696963852452165675233051829874743057694774871561834399141180387991829036797761712004334342926105
1000	1419.422480945995686465989038079916819232100601064166016304690814684608676417593010417911343291179210
cpu/wall(s): 5.36 0.84
virt/peak/res/peak(MB): 525.91 525.92 12.00 12.00

Compute 10 zeros starting with index 1012:

> build/examples/zeta_zeros -n 1000000000000 -count 10 -threads 8 -digits 30
1000000000000	267653395648.625948242142649409
1000000000001	267653395648.847523129021300806
1000000000002	267653395649.362366968656043775
1000000000003	267653395649.681630916477931951
1000000000004	267653395649.861989944051275225
1000000000005	267653395650.157665479031164314
1000000000006	267653395650.434266684414753381
1000000000007	267653395650.580891300763755538
1000000000008	267653395650.834479534143654039
1000000000009	267653395651.058482002884137266
cpu/wall(s): 16.766 12.978
virt/peak/res/peak(MB): 468.57 468.57 35.19 35.44

Miscellaneous

Multiplying an infinity by a positive or negative number now yields a signed infinity instead of $[0 \pm \infty]$ (contributed by Erik Postma).

There is now a module (fmpzi) for efficient arithmetic in the ring $\mathbb{Z}[i]$ of Gaussian integers. At the moment, this module only does basic arithmetic (addition, multiplication and powering), which was needed in the implementation of elementary functions; in the future, it will be expanded with functionality like divisions, GCD and factoring.



fredrikj.net  |  Blog index  |  RSS feed  |  Follow me on Mastodon  |  Become a sponsor