Faster high-precision arctangents

December 7, 2014

A quick Arb development update here. Recently, I wrote a fast implementation of the arctangent function for precision up to roughly 4096 bits. At higher precision, the code would fall back to MPFR. I have now written a custom implementation for the greater-than-4096-bit range. For an argument close to unity, the new implementation is 1.5 to 1.8 times faster than MPFR and asymptotically uses 1/3 as much memory. Some timing measurements (with $x = \exp(1)/3$):

Precision (bits)	Old (s)	New (s)	Speedup
4096	0.000191	0.000189	1.01
8192	0.00147	0.000871	1.69
16384	0.00451	0.00274	1.65
32768	0.0151	0.00861	1.75
65536	0.0554	0.0308	1.80
131072	0.162	0.0928	1.75
262144	0.492	0.278	1.77
524288	1.31	0.757	1.73
1048576	3.2	2.01	1.59
2097152	7.54	4.97	1.52
4194304	17.8	11.9	1.49
8388608	43.2	28.6	1.51
16777216	102	68.6	1.48
33554432	246	165	1.49

How is $\operatorname{atan}(x)$ computed? The first step is to replace $x$ by $1/x$ if $|x| > 1$. Secondly, we apply the argument-halving formula $$\operatorname{atan}(x) = 2 \operatorname{atan}\left(\frac{x}{1+\sqrt{1+x^2}}\right)$$ up to 8 times, that is until $|x| \le 2^{-8}$. The number 8 is a tuning parameter: it is sufficient to do a single step, but doing up to 8 steps turns out to be optimal at any precision between a few thousand bits and ten million bits. Next, we use the "bit-burst algorithm", repeatedly writing $$\operatorname{atan}(x) = \operatorname{atan}(u/2^r) + \operatorname{atan}(x')$$ where $$x' = \frac{x 2^r-u}{2^r+u x}$$ for $r = 16, 32, 64, \ldots$ to reduce the problem to evaluating $\operatorname{atan}(u/2^r)$ where $u$ is an integer with at most $r / 2$ bits. Each such arctangent is computed using binary splitting, and the power-of-two steps guarantee that this scheme has softly optimal asymptotic complexity.

The speedup is largely due to the following three differences compared to MPFR:

In the argument-halving stage, the argument is kept as a numerator-denominator pair until the end, which saves a division in each iteration. This gives a 15% speedup.
MPFR uses a different form of the bit-burst algorithm, which amounts to integrating a differential equation (the algorithm is described in Modern Computer Arithmetic) instead of using a functional equation for the arctangent. It appears that the functional equation approach is faster. I got the idea to use the functional equation from these presentation slides by Richard Brent.
The binary splitting is done recursively, rather than iteratively.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor