fredrikj.net / blog /

# Faster high-precision arctangents

*December 7, 2014*

A quick Arb development
update here. Recently, I wrote a fast implementation of the arctangent function for
precision up to roughly 4096 bits.
At higher precision, the code would fall back to MPFR.
I have now written a custom implementation for the greater-than-4096-bit range.
For an argument close to unity, the new implementation is **1.5 to 1.8 times faster** than MPFR and
asymptotically uses **1/3** as much memory. Some timing measurements (with $x = \exp(1)/3$):

Precision (bits) | Old (s) | New (s) | Speedup |

4096 | 0.000191 | 0.000189 | 1.01 |

8192 | 0.00147 | 0.000871 | 1.69 |

16384 | 0.00451 | 0.00274 | 1.65 |

32768 | 0.0151 | 0.00861 | 1.75 |

65536 | 0.0554 | 0.0308 | 1.80 |

131072 | 0.162 | 0.0928 | 1.75 |

262144 | 0.492 | 0.278 | 1.77 |

524288 | 1.31 | 0.757 | 1.73 |

1048576 | 3.2 | 2.01 | 1.59 |

2097152 | 7.54 | 4.97 | 1.52 |

4194304 | 17.8 | 11.9 | 1.49 |

8388608 | 43.2 | 28.6 | 1.51 |

16777216 | 102 | 68.6 | 1.48 |

33554432 | 246 | 165 | 1.49 |

How is $\operatorname{atan}(x)$ computed? The first step is to replace $x$ by $1/x$ if $|x| > 1$. Secondly, we apply the argument-halving formula $$\operatorname{atan}(x) = 2 \operatorname{atan}\left(\frac{x}{1+\sqrt{1+x^2}}\right)$$ up to 8 times, that is until $|x| \le 2^{-8}$. The number 8 is a tuning parameter: it is sufficient to do a single step, but doing up to 8 steps turns out to be optimal at any precision between a few thousand bits and ten million bits. Next, we use the "bit-burst algorithm", repeatedly writing $$\operatorname{atan}(x) = \operatorname{atan}(u/2^r) + \operatorname{atan}(x')$$ where $$x' = \frac{x 2^r-u}{2^r+u x}$$ for $r = 16, 32, 64, \ldots$ to reduce the problem to evaluating $\operatorname{atan}(u/2^r)$ where $u$ is an integer with at most $r / 2$ bits. Each such arctangent is computed using binary splitting, and the power-of-two steps guarantee that this scheme has softly optimal asymptotic complexity.

The speedup is largely due to the following three differences compared to MPFR:

- In the argument-halving stage, the argument is kept as a numerator-denominator pair until the end, which saves a division in each iteration. This gives a 15% speedup.
- MPFR uses a different form of the bit-burst algorithm, which amounts to integrating a differential equation (the algorithm is described in Modern Computer Arithmetic) instead of using a functional equation for the arctangent. It appears that the functional equation approach is faster. I got the idea to use the functional equation from these presentation slides by Richard Brent.
- The binary splitting is done recursively, rather than iteratively.