Optimizing erf in arbitrary precision

December 2, 2021

My small coding project this week has been to optimize the functions $\operatorname{erf}(x)$ and $\operatorname{erfc}(x) = 1 - \operatorname{erf}(x)$ for real arguments in Arb. The new code is asymptotically faster (quasi-optimal), typically 2-10 times faster than the old code, and hundreds of times faster than MPFR at high precision.

Contents

Hypergeometric series
Bit-burst evaluation
Benchmark results
- Erf
- Erfc
- integrals.c
Future work

Hypergeometric series

The error functions are special cases of the incomplete gamma functions $\gamma(a,z)$ and $\Gamma(a,z)$ and their evaluation basically boils down to computing hypergeometric series. If we define $$U_a(z) = \sum_{k=0}^{N-1} \frac{z^k}{(a)_k}, \quad N \to \infty,$$ $$V_a(z) \sim \sum_{k=0}^{N-1} \frac{(a)_k}{z^k}, \quad |z| \to \infty,$$ then $$\operatorname{erf}(x) = \frac{2 x e^{-x^2}}{\sqrt{\pi}} U_{3/2}(x^2)$$ $$\operatorname{erfc}(x) = \frac{e^{-x^2}}{x \sqrt{\pi}} V_{1/2}(-x^2), \quad x > 0.$$ There are the usual boring technical details: we want to use the asymptotic series $V$ whenever possible (i.e. when $x$ is sufficiently large) for both erf and erfc; we want to choose the number of terms $N$ and the working precision optimally (we have cancellation computing erfc via $U$; on the other hand, we gain precision for free when computing erf via $V$). Fortunately, the parameter and algorithm selection can be done quite easily using machine-precision arithmetic where we do not need to worry too much about being rigorous since the actual evaluation in ball arithmetic gives certified error bounds in the end.

Most of the speedup in the new code comes from implementing optimized versions of the series $U$ and $V$ for small rational parameters $a = p/q$. For typical input these series are most efficiently evaluated using a Horner-style backward recurrence combined with rectangular splitting (meaning that we compute powers $z, z^2, \ldots, z^m$ for $m \approx \sqrt{N}$ so that we can perform $N$ cheap scalar operations and only about $2 \sqrt{N}$ full nonscalar multiplications). Previously erf and erfc relied on the generic hypergeometric series code in Arb; the new, specialized code keeps the partial factorial coefficients in machine words as long as possible and calls the fast dot product code in Arb (arb_dot) to do several scalar operations at once.

The steps get a bit messy when putting everything together, so I coded up a working implementation of the series evaluation in Python before doing a final, optimized C version:

from mpmath import *
mp.dps = 50

def U(p, q, z, N):
    S = 0
    m = max(int(N**0.5), 2)
    j = (N - 1) % m
    cs = [0] * (m + 1)
    # baby-step powers
    upow = [0] * (m+1)
    upow[0] = mpf(1)
    for i in range(1, m+1):
        upow[i] = upow[i-1] * (z * q)
    c = 1
    k = N - 1
    # keep list of scalar terms for next dot product
    todo = []
    jlen = 0
    jbot = j
    while k >= 0:
        if k != 0:
            # partial factorial c will exceed one word, so evaluate dot
            # product, divide by current denominator, and continue
            if c * (p + (k - 1) * q) >= 2**64:
                if jlen:
                    for i in range(jlen):
                        S += upow[jbot + i] * cs[jbot + i]
                    jlen = 0
                S /= c
                c = 1
        # next scalar term
        cs[j] = c
        jlen += 1
        jbot = j
        if k != 0:
            # next partial factorial (will fit in one word)
            c = c * (p + (k - 1) * q)
            if j == 0:
                # giant step ahead, so evaluate current dot product
                if jlen:
                    for i in range(jlen):
                        S += upow[jbot + i] * cs[jbot + i]
                    jlen = 0
                # giant step
                S *= upow[m]
                j = m - 1
            else:
                j -= 1
        k -= 1
    # there are scalar terms left
    if jlen:
        for i in range(jlen):
            S += upow[jbot + i] * cs[jbot + i]
        jlen = 0
    S /= c
    return S

def V(p, q, z, N):
    S = 0
    m = max(int(N**0.5), 2)
    j = (N - 1) % m
    k = N - 1
    u = 1 / z
    # baby-step powers
    upow = [0] * (m+1)
    upow[0] = mpf(1)
    for i in range(1, m+1):
        upow[i] = upow[i-1] * (u/q)
    while k >= 0:
        # we will have at least one term
        jlen = 1
        jtop = jbot = k
        # go down several terms as long as the partial product c fits
        # in a single word and as long as we do not hit a giant-step point
        if jtop > 0:
            c = p + q * (jtop - 1)
            while jlen <= j:   # as long as there are baby steps left
                if jbot >= 2:
                    c *= (p + q * (jbot - 2))
                if c >= 2**64:   # too large coefficient
                    break
                jbot -= 1
                jlen += 1
        # now build the partial factorials (everything fits in one word)
        cs = [0] * jlen
        if jbot == 0:
            cs[0] = 1
        else:
            cs[0] = (p + q * (jbot - 1))
        for i in range(1, jlen):
            cs[i] = cs[i - 1] * (p + q * (jbot + i - 1))
        # compute the dot product
        s = 0
        for i in range(jlen - 1):
            s += cs[i] * upow[j - jlen + 1 + i]
        S = s + cs[jlen - 1] * (S + upow[j])
        k -= jlen
        j -= (jlen - 1)
        # giant-step multiplications
        if j == 0 and k >= 1:
            S *= upow[m]
            j = m - 1
        else:
            j -= 1
    return S

N = 75

print(U(5, 6, mpf("0.321"), N))
print(sum(mpf("0.321")**k / rf(mpf(5)/6, k) for k in range(N)))

print(V(5, 6, mpf("123.1"), N))
print(sum(rf(mpf(5)/6, k) / mpf("123.1")**k for k in range(N)))

This could probably be cleaned up a bit, but it does not matter much for the C code where the bookkeeping is virtually free and most time is spent in the arbitrary-precision operations.

A last trick that I included in the final C code is to vary the precision gradually. For example, in the $V$ series we only need about $b + \log_2(k! / z^k) \approx b - (k \log(z) + k (\log(k) - 1)) / \log(2)$ bits of working precision at term $k$ for a $b$-bit final result.

Are there things left to improve? I can think of some tricks that could help at low precision (up to a few limbs, or a few hundred bits).

Accumulating partial factorials up to two limbs instead of one limb might be better.
For the $U$ series, it would be nice if there was a way to eliminate all the scalar divisions and only do a single division at the end, but I don't see how to do this using a Horner-style reverse recurrence. It can be done using a forward recurrence for the terms, but then it seems that one needs more nonscalar multiplications. I'm sure that the divisions can be optimized one way or another, but I'm not sure about the best way to do it.
Using GMP's low-level mpn functions instead of Arb functions for arithmetic should be a bit faster. Arb uses such hand-written mpn code for functions like exp, log, sin, cos and atan, but such code takes a huge amount of time to write and debug, and I don't think it's worth the trouble for nonelementary functions. It could be an interesting future project to implement mpn-optimized generic hypergeometric series, but it would perhaps make more sense to do it semi-automatically using code generation.
Computing the error functions using hypergeometric series is nice because it does not require any precomputation. However, precomputed piecewise Chebyshev or rational function approximants should perform much better up to at least a couple of limbs of precision, when used with low-overhead polynomial evaluation (for instance, using vectorized double-double and quad-double arithmetic). The drawback is that such precomputed tables add bloat, though it might be worth it for $\operatorname{erf}(x)$ and $\operatorname{erfc}(x)$ which are very commonly used functions.

Bit-burst evaluation

The classical method using hypergeometric series achieves $d$-digit precision in time $\tilde O(d^2)$, but the error function (and holonomic functions more generally) can be computed in quasilinear time $\tilde O(d)$ using the bit-burst algorithm. I describe the bit-burst algorithm for the incomplete gamma function in my most recent paper, Rapid computation of special values of Dirichlet $L$-functions, and I will not repeat the details here. In short, the idea is to numerically integrate the defining ODE (or in this case, the explicit integral representation) using the Taylor series method, following a path that converges exponentially to the target point. Each Taylor series is evaluated using binary splitting.

The old code for erf and erfc in Arb did not use the bit-burst algorithm; it is now used starting from about 10000 digits (the precise cutoff varies with $x$).

Benchmark results

Erf

The following benchmark compares MPFR 4.1.0, the old Arb code and the new Arb code, at precision between $d = 10$ and $d = 10^6$ decimal digits. The table shows the total time to evaluate $\operatorname{erf}(x)$ on the set of points $x = n / \pi$ for $n = 1, 4, 9, 16, 25, \ldots$, $n^2 < 100d$, chosen to exercise the near-zero, transition and asymptotic regions.

Time to compute $\operatorname{erf}(x)$

Digits	# points	MPFR	Arb (old)	Arb (new)	Speedup vs MPFR	Speedup vs Arb (old)
10	5	1.979e-05 s	7.058e-05 s	1.058e-05 s	1.9x	6.7x
100	10	0.0007202 s	0.0005248 s	0.000108 s	6.7x	4.9x
1000	17	0.1486 s	0.02244 s	0.005034 s	29.5x	4.5x
10000	31	51.26 s	2.166 s	0.545 s	94.0x	4.0x
100000	56	-	400.3 s	38.77 s	-	10.3x
1000000	30*	-	11780 s*	713.8 s*	-	16.5x*

There is a big asymptotic improvement. Arb is now also faster than MPFR even at low precision, where the old code had a lot of overhead. MPFR is clearly not asymptotically fast, and I did not time MPFR beyond 10000 digits since it would take hours. Also, at $d = 10^6$ (*) I only timed the first 30 points since the old Arb code would take hours. At a million digits, the new code should actually be more than 30 times faster if tested over the whole range.

The following plot shows the running time for each evaluation point $x$:

At each level of precision, there is a peak just before the point where the asymptotic series can be used. Interestingly, the bit-burst algorithm flattens this peak significantly, providing a much bigger speedup for worst-case (transition region) input than for best-case (small or large) input.

Erfc

Results for the complementary error function are similar:

Time to compute $\operatorname{erfc}(x)$

Digits	# points	MPFR	Arb (old)	Arb (new)	Speedup vs MPFR	Speedup vs Arb (old)
10	5	0.0002405 s	8.636e-05 s	2.17e-05 s	11.1x	4.0x
100	10	0.02013 s	0.0007636 s	0.0002297 s	87.6x	3.3x
1000	17	1.683 s	0.03567 s	0.01335 s	126.1x	2.7x
10000	31	1224 s	3.79 s	1.198 s	1021.8x	3.2x
100000	56	-	740.5 s	88.28 s	-	8.4x

It was useful to do these benchmarks, because it helped me fix some performance bugs (manifest as ugly jumps in the plots). The difficulty with this kind of code is not to get it to compute the right result, but to ensure that it really uses the fastest strategy for any combination of argument and precision.

integrals.c

The integrals example program in Arb is a nice benchmark for numerical integration of special functions; one of the test integrals is $I_{25} = \int_{-1}^1 \operatorname{erf}(x/\sqrt{0.0002} \cdot 0.5+1.5) \exp(-x) dx$ (this might look like a rather random expression, but apparently it comes from some filter design problem). With the old erf code in Arb, I get the following:

fredrik@agm:~/src/arb$ build/examples/integrals -i 25 -prec 3333 -twice
I25 = int_{-1}^1 erf(x/sqrt(0.0002)*0.5+1.5)*exp(-x) dx ...
cpu/wall(s): 8.432 8.459
cpu/wall(s): 5.968 5.977
I25 = [-0.999065350291922475595750121145204168350053798854304261675434236397647600528120969815393838720720375640828014003762805519591294130016120467790129722324951408781834564818093907395215333918322334982526157655702660040833458542548570010107191669210794251238099768183223483551915203561837292344886302988001649432912871441018647723481470530778134434434548725494538049265153407800600702936790278163581125702150521709773064064369278085423055313542491256388670879639234025517879247787124652921697420987793312085502816812415859065399749815612476275744949579490030785354660162320287087937779211229011984207035965966889137924252515323067615357304645140148719528950109656006409608979225636236035466068783073541110832490168186452805066330518241105692335201357486861859372592484136448983636611097029998450809071050276120529788024141927886342082068705495858291381594879770669297822945941389817084856154885756286731213143241645287710983424754201961169155341375676381697921097233384462699464470748438359477704909853811 +/- 3.83e-1000]

With the new code:

fredrik@agm:~/src/arb$ build/examples/integrals -i 25 -prec 3333 -twice
I25 = int_{-1}^1 erf(x/sqrt(0.0002)*0.5+1.5)*exp(-x) dx ...
cpu/wall(s): 4.632 4.633
cpu/wall(s): 1.944 1.964
I25 = [-0.999065350291922475595750121145204168350053798854304261675434236397647600528120969815393838720720375640828014003762805519591294130016120467790129722324951408781834564818093907395215333918322334982526157655702660040833458542548570010107191669210794251238099768183223483551915203561837292344886302988001649432912871441018647723481470530778134434434548725494538049265153407800600702936790278163581125702150521709773064064369278085423055313542491256388670879639234025517879247787124652921697420987793312085502816812415859065399749815612476275744949579490030785354660162320287087937779211229011984207035965966889137924252515323067615357304645140148719528950109656006409608979225636236035466068783073541110832490168186452805066330518241105692335201357486861859372592484136448983636611097029998450809071050276120529788024141927886342082068705495858291381594879770669297822945941389817084856154885756286731213143241645287710983424754201961169155341375676381697921097233384462699464470748438359477704909853811 +/- 3.70e-1000]

All in all, a 3x speedup for repeated evaluation (quadrature nodes cached) at 1000-digit precision.

Future work

I might do more cases of the incomplete gamma functions in the near future since much of the erf code can be reused. And I should optimize the same functions for complex variables too, of course.

Many other special functions in Arb (exponential integrals, Bessel functions, etc.) can and should be optimized in the same way, but it's a lot of work!

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor