fredrikj.net / blog /

# The Arb matrix reloaded

*June 29, 2018*

Basic linear algebra in Arb is getting a significant upgrade in Arb 2.14 which is due to be released within a few weeks. The main improvements are more accurate linear solving and faster matrix multiplication.

## Linear solving

The current development version
of Arb (2.14-git)
offers a dramatic improvement for solving large well-conditioned linear systems.
To illustrate this, the following program constructs and solves
$AX = B$ where $A$ is an $n \times n$ real matrix and
$B$ is an $n \times 1$ vector.
The program starts with 53-bit precision and doubles the precision until
`arb_mat_solve` succeeds to prove that
the matrix is invertible.
The program then terminates and prints the first, middle and last entry in the solution
vector with up to 20 digits.
We take $n = 1000$ and we set
$A$ to the size-$n$ discrete cosine transform (DCT) matrix
with entries
$$A_{j,k} = \sqrt{\frac{2}{n}} \cos\left(\tfrac{\pi j}{n} \left(k+\tfrac{1}{2}\right)\right).$$
We set $B$ to the vector of all ones.
No special properties of the DCT matrix are being used here;
it is just a convenient test matrix that is well-conditioned
(the function `arb_mat_dct` used to create this matrix is a new
addition to Arb; some other functions to generate useful special
matrices have also been added).

#include "arb_mat.h" #include "flint/profiler.h" int main() { arb_mat_t A, B, X; slong prec, n; n = 1000; arb_mat_init(A, n, n); arb_mat_init(B, n, 1); arb_mat_init(X, n, 1); TIMEIT_START for (prec = 53; ; prec *= 2) { printf("prec = %ld\n", prec); arb_mat_dct(A, 0, prec); arb_mat_ones(B); if (arb_mat_solve(X, A, B, prec)) { arb_printn(X->rows[0], 20, 0); printf("\n"); arb_printn(X->rows[X->r/2], 20, 0); printf("\n"); arb_printn(X->rows[X->r-1], 20, 0); printf("\n"); break; } } TIMEIT_STOP arb_mat_clear(A); arb_mat_clear(B); arb_mat_clear(X); return 0; }

This is the before and after comparison:

Arb 2.13 | Arb 2.14-git |
---|---|

prec = 53 prec = 106 prec = 212 prec = 424 prec = 848 prec = 1696 [+/- 2.99e+14] [0.031587680085551469491 +/- 4.08e-22] [0.0092445347862472445663 +/- 2.78e-24] cpu/wall(s): 442.749 442.805 |
prec = 53 [28.47975797950 +/- 7.97e-12] [0.031587680086 +/- 5.06e-13] [0.0092445347862 +/- 9.55e-14] cpu/wall(s): 18.348 18.349 |

The old version of `arb_mat_solve`
uses direct Gaussian elimination to compute the LU factorization
of the system matrix.
This is numerically
unstable in interval arithmetic in general, even when the system is well-conditioned.
The precision needed for convergence scales like $O(n)$
and in this case exceeds 1000 bits.
In fact, even after the solving succeeds, the top entry in the solution
vector is wildly imprecise and we would have to double the precision
one more time to get it precise.

The new version of `arb_mat_solve` completes **24 times faster**,
and more importantly,
succeeds directly at 53-bit precision with **only a few digits
of precision loss**!
As a reference point, we can evaluate the matrix-vector product $A^T B$ (which is mathematically
equivalent to $A^{-1} B$ for this $A$ since the DCT matrix is orthogonal); the corresponding
entries then come out as:

[28.47975797950 +/- 2.78e-12] [0.0315876800856 +/- 6.38e-14] [0.0092445347862 +/- 6.03e-14]

We see that the error bounds from `arb_mat_solve`
are just a small factor larger than the optimal bounds.
This is a big deal in applications where the input matrices may be
expensive to generate at high precision.
In fact, `arb_mat_solve` still succeeds on the first try
if we set the starting precision to 20 bits (and this is even faster):

prec = 20 [2.8e+1 +/- 0.505] [0.032 +/- 9.06e-4] [0.009 +/- 6.61e-4] cpu/wall(s): 12.385 12.396

The main ingredient in the new `arb_mat_solve` is
the Hansen-Smith algorithm from 1967.
Instead of solving $AX = B$ directly, we compute an approximate inverse
matrix $R \approx A^{-1}$ and then solve the preconditioned system $(RA)X = RB$.
Gaussian elimination on the matrix $RA \approx I$ is numerically stable
even in interval arithmetic (in fact, in some cases we can just bound the
perturbation away from the identity matrix and skip the last solving
step entirely). Computing $R$ can be done heuristically with the usual
floating-point Gaussian elimination,
and ball arithmetic only enters when computing $RA$, $RB$ and
in the last solving step.
The credit for implementing this algorithm in Arb goes to a contributor
who wished to remain anonymous.

The drawback, of course, is that we are doing 3-4 times more work than just running Gaussian elimination once. This slowdown is offset to some extent by another improvement: I have rewritten Gaussian elimination to use a block recursive strategy which takes advantage of matrix multiplication, and matrix multiplication has been optimized (see below).

The following plot and table compare the performance on the DCT matrix benchmark as the dimension $n$ varies:

$n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|

10 | 5.68e-05 | 0.000214 | 0.26 |

30 | 0.00107 | 0.00329 | 0.33 |

100 | 0.111 | 0.0558 | 1.99 |

300 | 3.658 | 0.845 | 4.33 |

1000 | 442.749 | 18.348 | 24.1 |

For sizes from about to 6 to 50 where the old algorithm succeeds without increasing the precision (for this particular matrix), the new algorithm is about three times slower, but the improvement in accuracy is dramatic. Here showing the first entry in the output vector, for different $n$:

n Arb 2.13 Arb 2.14-git ====================================================================== 5 [2.1275693661542 +/- 5.10e-14] [2.1275693661542 +/- 5.10e-14] 6 [2.3122784967091 +/- 8.64e-14] [2.3122784967091 +/- 2.12e-14] 7 [2.482712359015 +/- 3.50e-13] [2.4827123590148 +/- 5.33e-14] 8 [2.641845987495 +/- 7.03e-13] [2.6418459874955 +/- 2.08e-14] 9 [2.79172023714 +/- 3.30e-12] [2.7917202371376 +/- 4.33e-14] 10 [2.93381472088 +/- 9.40e-12] [2.9338147208785 +/- 1.04e-14] 15 [3.55934770 +/- 1.52e-9] [3.5593476994815 +/- 3.07e-14] 20 [4.089760 +/- 5.21e-7] [4.0897599643974 +/- 6.84e-14] 25 [4.559 +/- 4.29e-4] [4.5586791661163 +/- 6.77e-14] 30 [5.0 +/- 0.0546] [4.9835836367848 +/- 5.87e-14] 35 [+/- 11.3] [5.3749570875008 +/- 5.67e-14] 40 [+/- 6.55e+3] [5.7396790611281 +/- 6.55e-14] 45 [+/- 3.93e+5] [6.082554137056 +/- 2.18e-13]

Using the slower but more accurate algorithm by
default seems like a sensible tradeoff since losing several
significant digits
may be much more costly overall when propagated through a larger computation.
It should be noted that the code
switches back to direct Gaussian elimination when the
precision is much higher than the expected precision loss, so no slowdown
will occur when solving 20 by 20 systems at 1000-digit precision,
for example.
Users also have the option of selecting an algorithm
manually by calling `arb_mat_solve_lu` (direct Gaussian elimination)
or `arb_mat_solve_precond` (the Hansen-Smith algorithm)
instead of the default `arb_mat_solve`.

### Versus the machine

With $n = 1000$ and 53-bit precision, the new `arb_mat_solve`
is "only" 150 times slower than
`numpy.linalg.solve` (LAPACK `dgesv`) which takes 0.125 seconds
to compute a non-certified solution.
Roughly one order of magnitude is due to the inherent overhead of
the certified solving algorithm,
and another order of magnitude is due to using arbitrary-precision arithmetic.
The second item is certainly improvable: a factor two could be saved
by using `double` internally for the
approximate solving, and another factor three could probably
be saved with use of BLAS and other optimizations in FLINT.
This would bring the factor down to 25 which is in the ballpark
of machine-precision interval software such as
INTLAB
which reportedly solves a size-1000 system 12 times slower than
MATLAB's backslash operator.
In principle, the solving in Arb at precision up to 53 bits
could be made nearly as fast as
INTLAB by converting the matrices to `double` format
and doing all calculations (including certifications) in this format internally.
Implementing this correctly is complicated, however,
and the design is more conservatively structured around
arbitrary-precision arithmetic for now.

### Ill-conditioned systems

For solving ill-conditioned systems, increasing the precision
(typically to $O(n)$ bits)
is unavoidable.
If we change the DCT matrix to
a Hilbert matrix, the old and new versions of `arb_mat_solve`
compare as follows:

$n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|

10 | 0.00010 | 0.000203 | 0.49 |

30 | 0.00529 | 0.0128 | 0.41 |

100 | 0.5 | 1.013 | 0.49 |

300 | 85.362 | 87.791 | 0.97 |

Again, a slight slowdown for ill-conditioned systems is a necessary tradeoff
to optimize for well-conditioned systems and for global efficiency,
and users who know that they are dealing with a system of this type
in the range where direct Gaussian elimination is faster
can always call `arb_mat_solve_lu` manually.

### Matrix inversion

The improved linear solving also extends to matrix inversion, which just amounts to solving $AX = I$. In fact, the new code performs even better in this case compared to the old code: with the old code, solving with a square matrix right-hand side is about twice as expensive as with a vector right-hand side, but with the new algorithm, a matrix right-hand side is barely more expensive than a vector.

If we change the DCT benchmark code above
to compute $A^{-1}$ (by calling `arb_mat_inv`
instead of `arb_mat_solve`), we find that Arb 2.14-git is
**50 times faster** than Arb 2.13 when $n = 1000$.

## Matrix multiplication

Matrix multiplication in Arb previously just used the classical
algorithm, executed in ball arithmetic (`arb_mat_mul_classical`).
There is also a multithreaded version (`arb_mat_mul_threaded`)
of classical multiplication that helps for large
matrices if the user has a lot of cores available.
However, the classical algorithm is not ideal for arbitary-precision
arithmetic, because there is a huge amount of overhead for each
arithmetic operation.

Polynomial multiplication in Arb uses the trick of breaking
the polynomials into blocks of coefficients that have nearly the
same magnitude, rescaling these blocks to have integer coefficients,
and multiplying the individual blocks using FLINT's blazingly
fast $\mathbb{Z}[x]$ (`fmpz_poly`) arithmetic.
The new matrix multiplication code in Arb 2.14-git uses an analogous
scheme for matrices, multiplying blocks of coefficients
with similar magnitude via FLINT's `fmpz_mat`.

### The algorithm in more detail

The code implements a scaling trick for individual blocks: if $A_i$ and $B_i$ are the submatrices to be multiplied, the algorithm inspects the exponent changes along rows of $A_i$ and along columns of $B_i$ to determine diagonal matrices (specifically, with power-of-two entries) $D_1$ and $D_2$ such that $D_1 A_i$ and $B_i D_2$ are integer matrices with nearly minimal height. It then evaluates $A_i B_i = D_1^{-1} (D_1 A_i) (B_i D_2) D_2^{-1}$.

If the input matrices $A$ and $B$ have coefficients of nearly uniform
magnitude (in individual rows of $A$ and columns of $B$),
a single block $A = A_i$ and $B = B_i$ can be used.
The complicated part of the algorithm is to split non-uniformly scaled matrices
into blocks that are optimal for efficiency.
On one hand, the blocks should be as large as possible,
since this makes `fmpz_mat` matrix multiplication more advantageous.
On the other hand, when the coefficient magnitudes vary, larger blocks result
in larger integers which slows things down.
I have implemented a fairly simple heuristic which consists of
greedily scanning forward
for maximal consecutive spans of columns of $A$ and rows of $B$ that fit within
a limited exponent range.
I make no claims that this is the best way to do it, but it works reasonably well.

Integer matrix multiplication is a bit faster already for small matrices since it has less overhead than ball arithmetic. It becomes much faster for huge matrices (say $n > 100$) since it can use a multimodular algorithm performing one or several matrix multiplications modulo single-word primes. FLINT also uses Strassen multiplication so the complexity is in fact $O(n^{2.81})$ and not $O(n^3)$.

Integer matrices are used for the multiplying the matrices
of midpoints. When the
matrices are inexact, say $A = A_m \pm A_r$ and $B = B_m \pm B_r$,
the two additional matrix products $(|A_m| + A_r) B_r$ and $A_r |B_m|$
are computed by splitting into scaled blocks
and using `double` arithmetic.
This is faster and works well since all entries are nonnegative
making it trivial to compute tight upper bounds.

One could implement Strassen multiplication and other tricks directly in Arb, but this is not a good idea in most cases since it is less numerically stable than classical matrix multiplication. The new matrix multiplication code is designed to give at least as good error bounds as the classical algorithm for every individual coefficient in the output. In fact, the error bounds are usually a bit better since fewer internal rounding operations are performed.

Future improvements are certainly possible. Matrix multiplication in FLINT could probably be sped up by about a factor three using BLAS and other optimizations (FLINT currently uses non-vectorized 64-bit integer arithmetic for modular matrix multiplications and has unoptimized basecase code for modular reduction and Chinese remaindering).

### Some benchmarks

I have done a bit of benchmarking with three families of test matrices:

- Hilbert matrices, with entries $A_{i,j} = 1/(i+j+1)$. These are (for the purposes of multiplication) generic real matrices with uniformly scaled entries, which can be multiplied using a single block.
- Small-integer matrices, with entries $A_{i,j} = i + j + 1$.
- Pascal matrices (times a constant factor), with entries $A_{i,j} = \pi \cdot {i+j \choose i}$. Since binomial coefficients grow rapidly (${2 n \choose n} \approx 4^n$), these matrices are a bad case for the multiplication algorithm, requiring splitting into many smaller blocks when $n$ is large. The factor $\pi$ makes the entries generic full-precision real numbers instead of integers.

For this test, we set both operands to the same matrix, and we compare multiplication speed at a precision of 53, 128, 333 and 1024 bits.

#### Hilbert matrices

The following plot shows the timings for Hilbert matrices. The different levels of precision are indicated with different colors and in the left subplot, the timings for the old code are shown as dashed curves while the timings for the new code are shown as solid curves:

At 1-2 words of precision, the new code is five times faster than the old code even for relatively small matrices and a factor 20 faster for matrices with dimension $n$ in the thousands. The time for an $n = 1000$ matrix multiplication is down from two minutes to 4-10 seconds. At hundreds of digits, the advantage of the new algorithm is starting to diminish but there is still a large speedup as soon as the dimension also gets large.

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

53 | 10 | 7.8e-05 | 3.2e-05 | 2.44 |

53 | 30 | 0.00191 | 0.00036 | 5.31 |

53 | 100 | 0.075 | 0.0103 | 7.28 |

53 | 300 | 2.014 | 0.172 | 11.7 |

53 | 1000 | 89.329 | 4.681 | 19.1 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

128 | 10 | 8.9e-05 | 8.3e-05 | 1.07 |

128 | 30 | 0.00229 | 0.00135 | 1.70 |

128 | 100 | 0.091 | 0.0231 | 3.94 |

128 | 300 | 2.428 | 0.321 | 7.56 |

128 | 1000 | 95.169 | 7.443 | 12.8 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

333 | 10 | 0.000129 | 0.000103 | 1.25 |

333 | 30 | 0.0043 | 0.00222 | 1.94 |

333 | 100 | 0.155 | 0.054 | 2.87 |

333 | 300 | 6.038 | 0.708 | 8.53 |

333 | 1000 | 257.607 | 16.122 | 16.0 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

1024 | 10 | 0.00056 | 0.00062 | 0.90 |

1024 | 30 | 0.0093 | 0.007 | 1.33 |

1024 | 100 | 0.371 | 0.218 | 1.70 |

1024 | 300 | 11.111 | 2.41 | 4.61 |

1024 | 1000 | 454.495 | 49.298 | 9.22 |

#### Small-integer matrices

The speedup is even more dramatic when multiplying Arb matrices
with small integer entries. This is now essentially as fast
as using `fmpz_mat` directly, up to the conversion overhead.
The plot below compares the performance for multiplying the matrices
with entries $A_{i,j} = B_{i,j} = i + j + 1$.
A single $n = 1000$ multiplication now takes one second, down from a minute,
and the speedup continues to grow to a factor 100 beyond $n = 2000$.

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

53 | 10 | 4e-05 | 1.26e-05 | 3.17 |

53 | 30 | 0.0011 | 0.000121 | 9.09 |

53 | 100 | 0.045 | 0.00188 | 23.9 |

53 | 300 | 1.495 | 0.035 | 42.7 |

53 | 1000 | 50.957 | 0.955 | 53.4 |

53 | 2000 | 568.174 | 6.412 | 88.6 |

(Here, the precision does not matter; the matrices and their product are exactly representable at 53-bit precision, and increasing the precision does not cause any slowdown.)

#### Pascal matrices

With Pascal matrices (multiplied by the factor $\pi$), the speedup is modest due to the need to use many blocks. Nevertheless, for large enough $n$, the new algorithm saves a factor 2-4 depending on the precision.

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

53 | 10 | 8.1e-05 | 3.8e-05 | 2.13 |

53 | 30 | 0.00218 | 0.00148 | 1.47 |

53 | 100 | 0.088 | 0.05 | 1.76 |

53 | 300 | 2.569 | 1.305 | 1.97 |

53 | 1000 | 96.37 | 40.518 | 2.38 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

128 | 10 | 8.9e-05 | 8.7e-05 | 1.02 |

128 | 30 | 0.00237 | 0.00275 | 0.86 |

128 | 100 | 0.092 | 0.064 | 1.44 |

128 | 300 | 2.819 | 1.92 | 1.47 |

128 | 1000 | 100.019 | 48.427 | 2.07 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

333 | 10 | 0.000145 | 0.000116 | 1.25 |

333 | 30 | 0.0043 | 0.00292 | 1.47 |

333 | 100 | 0.179 | 0.095 | 1.88 |

333 | 300 | 7.371 | 2.21 | 3.34 |

333 | 1000 | 288.91 | 77.335 | 3.73 |

$p$ | $n$ | Time, Arb 2.13 | Time, Arb 2.14-git | Speedup |
---|---|---|---|---|

1024 | 10 | 0.00036 | 0.00033 | 1.09 |

1024 | 30 | 0.0116 | 0.0098 | 1.18 |

1024 | 100 | 0.493 | 0.231 | 2.13 |

1024 | 300 | 16.088 | 4.965 | 3.24 |

1024 | 1000 | 453.439 | 154.31 | 2.94 |

## Complex matrices and more

Implementing fast multiplication, solving and matrix inverse
for complex matrices (`acb_mat`) is a work in progress
that will be finished before Arb 2.14 is released.

Determinants of real and complex matrices can also be improved using similar preconditioning tricks, but this requires a separate implementation which might be postponed until later.