# Tradeoffs in ball arithmetic

October 25, 2013

The conventional way to represent real numbers in a computer is by floating-point approximations. Floating-point arithmetic has many useful properties: it’s efficient, widely available, and well understood. The drawback of floating-point arithmetic is that it generally only gives approximations of the mathematically exact quantities one really wants to compute. In many applications, it’s crucial to be able to estimate the numerical error (i.e. the difference between the computed approximation and the exact result), at least heuristically. In computer algebra and computational number theory, it’s usually desirable to be able to bound the numerical error fully rigorously, to allow algorithmically proving statements involving real numbers.

One way to obtain bounds for a given numerical algorithm is to carry out a formal analysis of the floating-point error propagation. Loosely speaking, error analysis involves estimating both the round-off error and truncation error. Here’s an example: if we approximate $\exp(1)$ by $\sum_{k=0}^{N-1} \frac{1}{k!}$, the round-off error is the total error in the approximation of this finite sum resulting from floating-point additions, multiplications and divisions (the error depends on the order of operations); the truncation error is $\sum_{k=N}^{\infty} \frac{1}{k!}$. The total error can be bounded by the sum of (an upper bound for) the round-off error and (an upper bound for) the truncation error.

Error analysis done by hand is very time-consuming and prone to human error (at least unless you are French). With a moderately complex algorithm, doing a full error analysis can be completely impractical, at least if one wants bounds that are anywhere near useful. The standard solution to this problem is interval arithmetic: rather than representing a real number $x$ by a floating-approximation $\tilde x$, we represent it by an interval $[a,b]$ such that $x \in [a,b]$. Operations on floating-point intervals are implemented such that they respect inclusions, i.e. for a function $f(x)$ and an input $[a,b]$, we compute $[c,d]$ such that $[c,d] \subseteq f([a,b]) \equiv \{ f(t) | t \in [a,b] \}$.

After a bit of careful work to implement “atomic” functions for intervals (addition, multiplication, and so on), we can evaluate arbitrarily complex formulas with guaranteed error bounds by just composing atomic functions. When approximating limit values by finite formulas, we have to add bounds for the truncation errors, but this is often straightforward.

Naturally, there are some caveats. Special care is needed when doing case distinctions: an exact number is either negative, zero or positive, but an interval can include numbers with different signs. The accuracy of an output interval is highly sensitive to the sequence of operations: for example, $x – x$ does not generally evaluate to zero. In general, an interval can be correct without being useful. Iterative algorithms that converge in exact arithmetic, or even in floating-point arithmetic, do not necessarily converge in interval arithmetic, and can produce useless output such as $[-\infty,\infty]$.

Ball arithmetic is a version of interval arithmetic in which an interval is represented as $[x-r, x+r]$, or $x \pm r$ for short (where $x$ is the midpoint and $r$ is the radius). The ball representation has advantages and disadvantages compared to the endpoint representation. It is less flexible: for example, there is no way to represent $[1,\infty]$ as a ball (the closest thing is $0 \pm \infty$). Ball arithmetic becomes advantageous at high precision: the midpoint can be stored with full precision while the radius can be rounded to a few digits. Asymptotically, ball arithmetic then costs (1+ε)X as much as arbitrary-precision floating-point arithmetic, but with the benefit of automatic error bounds, whereas endpoint-based interval arithmetic costs roughly 2X as much. At low precision, ball arithmetic can have more overhead than endpoint-based interval arithmetic, however.

My primary goal with the Arb library has been to develop and implement algorithms for rigorous numerics that scale well to extremely high precision, taking advantage of the ball representation to automate as much of the error analysis as possible. A secondary goal is to make ball arithmetic competitive with traditional floating-point or interval libraries at lower precision (up to a few hundred digits), although I’m not quite there yet.

It took me a long time to decide on the data structures to use. It was easy to decide on using binary floating-point numbers as a base type (and not decimals, fixed-point numbers, fractions, continued fractions, or something else — the only serious alternative would be radix $2^{64}$ floating-point numbers as used by the mpf type in GMP, which avoids some bit fiddling but is less economical at low precision). I decided to implement a new floating-point type (fmpr) instead of just using MPFR numbers, chiefly for the following reasons:

• MPFR allocates an array of limbs on the heap, even when the value would fit in a single word
• MPFR stores numbers zero-padded to full precision, making them inefficient for representing varying-size integers
• MPFR exponents are limited to a single word. Working around overflow/underflow in MPFR can be a bit annoying.

An fmpr is just a pair of FLINT integers (fmpz), which fixes the above issues. It’s also a quite simple and elegant solution overall. The drawback is that there is some unnecessary overhead for numbers between approximately 64 and 1000 bits (for this reason, I intend to rewrite the fmpr type in the future, sacrificing some elegance for performance). Two other small differences compared to MPFR are that negative zero is not supported (it could be supported, but I don’t really see the point), and that the precision is passed as a separate parameter rather than being stored with each number.

When it comes to balls, do you represent them as $a 2^b \pm c 2^d$ or $(a \pm c) 2^b$ or $a 2^b (1 \pm c 2^d)$? I settled for the first representation in the fmprb type after some experiments. The second representation can be a bit more efficient, but after working with it for some time, I started vigorously hating not being able to modify the midpoint and the radius independently of each other. The third representation would be convenient for multiplications, but is probably worse in all other respects.

With the ball type in place, I decided to just make complex numbers pairs of real balls, and polynomials and matrices as arrays of real or complex numbers. There would be advantages to doing it differently. Storing a complex number as a midpoint with a single radius would be more compact (three floating-point numbers instead of four), and proper complex balls (i.e. actual, circular regions in the complex plane) sometimes provide tighter enclosures rather than rectangular “balls”. But the convenience of being able to easily separate a complex number into its real and imaginary parts seemed more important (most operations are implemented by separating real and imaginary parts, e.g. $\exp(a+bi) = e^a (\cos b + i \sin b)$). It’s also convenient to be able to represent a complex number with an exact (e.g. zero) real and inexact imaginary part, or vice versa.

More or less independent of the data structures used for ball arithmetic, a vast range of choices for the semantics are then available, giving different tradeoffs (in different situations) between speed, accuracy and convenience.

The basic contract of ball arithmetic is simply the following: given an input ball $X = x \pm r$ and a function $f$, $f(X)$ is approximated by a ball $Y = y \pm s$ such that $f(X) \subseteq Y$. This is a rather weak specification, though, as it does not allow us to reason about convergence. Ideally, if $x$ is fixed, we want $r \to 0$ to imply that $s \to 0$ (or more strongly, that $s = O(r)$). Such a convergence property clearly holds for arithmetic operations and elementary functions (away from singularities), as long as those functions are not implemented completely stupidly and the working precision is at least $-\log_2 r$ bits. We can therefore often assume such behavior implicitly. But for more complex mathematical operations, convergence guarantees are difficult to provide (whether for theoretical reasons, or for reasons of implementation difficulty): if the output is useful, good, and if not, you’ll just have to try something else.

For the semantics of basic arithmetic and standard mathematical functions, I can think of several approaches, including the following:

One (“best midpoints”) is to do arithmetic operations on the midpoints (as if doing regular floating-point arithmetic), and then separately computing the radius bounds. This is deterministic and nice. In many cases, the result of a ball computation will be something like $0.999999999999467 \pm 0.001$, where the error bound is much larger than the true error. The precisely computed midpoint is useful for algorithm development, as it allows finding cases where the error bounds could be improved. And for use in actual numerical algorithms, if you guess that the midpoint is more accurate than the radius indicates, you can sometimes just disregard the radius and do something useful with the midpoint (for example, use it as an initial value to another algorithm).

This approach has two drawbacks. A: if we only care about the digits that are provably correct in the end, keeping the extra digits around is wasteful. B: insisting that the midpoint be the best possible means sacrificing overall accuracy of the ball (if we want the endpoints to be the tightest possible, we generally have to move the midpoint).

Another approach (“trimmed best midpoints”) addresses drawback A. We compute the midpoint by a floating-point operation, and the radius separately, but use a floating-point precision for the midpoint that corresponds to the accuracy implied by the radius bound (plus a few more digits). Yet another approach (“best balls”) addresses drawback B: we simultaneously compute a midpoint and radius such that the resulting ball tightly encloses the mathematically exact function image.

There are still more choices. Should one compute the “best radius”, or just a “generic radius”? A typical “generic” radius bound for $f(x \pm r)$ would be $r C$ where $C = \sup_{|t-x| \le r} |f’(t)|$, plus a rounding error bound. This converges to the best bound when $r \to 0$, but can be wildly inaccurate when $r$ is large. Computing the best radius bound (even just approximately) generally requires much more work, although there are situations where it can be worth it.

Consider evaluating $\sin(10 \pm 3)$. With “best midpoints”, the output would be something like $-0.544021110889370 \pm 3.001$ (or $\ldots \pm 1.545$ with a “best radius” bound). With “trimmed best midpoints”, the output would be something like $-0.544 \pm 3.002$ (or $-0.544 \pm 1.545$ with a “best radius” bound). With “best balls”, the output would be something like $0 \pm 1$. These results are quite different, and it’s easy to construct a sequence of calculations where one becomes vastly superior to the other.

So far in Arb, I’m using “best midpoints” with “generic radius bounds”, at least for arithmetic operations and most of the time for elementary functions. In some cases, different approaches are used. Trimming can always be done manually by the user to improve performance, but this doesn’t help when the trimming should be done in the middle of an algorithm implemented in the library itself. I’m considering switching to “trimmed best midpoints” in the future (or perhaps keeping the current behavior by default and making automatic trimming optional).