# Revisiting transcendental functions

April 30, 2012

The overwhelming majority of the time when I use arbitrary-precision arithmetic, I only need precision slightly higher than hardware precision; typically 30-40 digits, occasionally perhaps 100 digits, and only very rarely 1000 digits or more. About two years ago, I started doing some experiments to see how quickly the most common transcendental functions (exp, log, cos, sin, atan, gamma) can be computed in this range (the old test code, now obsolete, is available here).

There are three main ideas. The first is simply to do arithmetic very efficiently, only using the GMP functions that are written in assembly (mpn_add_n, mpn_addmul_1, mpn_mul_basecase, etc.), avoiding divisions like the plague, eliminating temporary memory allocations and copies, and so on.

The second is to use complexity reduction techniques aggressively. For elementary (or hypergeometric) functions, one can get a complexity of $d^{2.333}$ instead of the naive $d^3$ at $d$-digit precision by combining argument reduction with baby-step/giant-step polynomial evaluation. With the proper low-level implementation, this should be effective essentially from $d = 1$.

The third is to take advantage of precomputation to speed up repeated evaluations. RAM is cheap, and in the range up to 100 digits or so, the space needed to speed up the most commonly used functions by large factors is measured in kilobytes. Of course, use of lookup tables increases cache pressure, but a cache miss is much cheaper than a whole sequence of operations on multi-limb integers, so this tradeoff is generally worth it (some benchmarking I did a while back also indicated that prefetch instructions worked wonderfully for this purpose, at least on my CPU).

I have now written a new basecase implementation for exp(), substantially improved compared to my previous experiments; the code can be found here. Using a modest 224 KB lookup table (with a small modification, this could be trimmed to about 160 KB) for 16-bit argument reduction allows going up to a precision of about 384 bits, or 115 decimal digits, on a 64-bit system. In other words plenty for most practical purposes. The lookup table itself takes just a few microseconds to generate dynamically.

Here is how it compares to some other libraries: the system double-precision exp, the double-double (dd) and quad-double (qd) types from D. H. Bailey’s QD library (measured using the supplied tests/qd_timer program), MPFR, and mpmath (using the Python implementation with GMPY types, not the code in Sage which wraps MPFR). All timings are in nanoseconds.

prec system dd qd mpfr mpmath new code
53 49 4300 20000 180
64 5500 20000 190
106 790 7200 24000 330
128 7800 26000 340
192 10000 29000 520
212 6500 10000 31000 690
256 11000 31400 780
320 12000 34000 1100
384 14000 38000 1500

A couple of things need to be pointed out.

Right now, this code is experimental. I have not tested it thoroughly, and it does not come with an error bound (however, the error is never larger than a few ulp, and a bound can be bounded quite easily, for which I intend to include code later on).

All other libraries take an arbitrary $x$, not necessarily restricted to $[0,\ln 2)$. The initial argument reduction is a division with remainder, where a low-precision value of $1/\ln 2$ can be precomputed, meaning that only a single mpn_submul_1 plus some adjustments should be necessary.

The system exponential function, MPFR and mpmath also effectively use higher internal precision to compute the exponential with 0.5 or 1 ulp error, so to be fair a few guard bits should be added.

Even after adding all remaining corrections, the code can actually probably be made faster. At 1 or 2 limb precision, replacing all calls to GMP functions with inline assembly should improve performance; the current code also contains a redundant division, which should save 60 nanoseconds or so when removed.

Among the elementary functions, exp is the easiest to implement; cos, sin, atan and log can be computed using similar principles, although there is some more overhead, so they will perhaps be a factor two slower. The next logical step is to add a second version of exp, for precisions between 384 bits and a few thousand digits. It will necessarily be a bit slower than the basecase version, but should still be a decent factor faster than the MPFR function in this range.