fredrikj.net / blog /

Further optimizing bigfloat matrix multiplication

October 17, 2024

How many tricks can we incorporate in multiprecision matrix multiplication?

The nfloat and nfloat_complex types in FLINT have recently gotten fast matrix multiplication code (improving on the already-decent classical matrix multiplication reported in the last post). As a result, here is the current speedup using nfloat_complex instead of acf as the floating-point ground type when calling gr_mat_nonsingular_solve to solve a linear system $Ax = b$ where $A$ is a random complex $n \times n$ matrix:

prec \ n 2 4 8 16 32 64 128 256 512 1024
64 1.906 2.232 2.265 2.125 1.867 1.766 1.800 1.746 1.622 1.330
128 1.581 2.104 2.235 2.471 2.096 1.992 2.059 1.854 1.610 1.363
192 1.823 2.181 3.439 4.088 3.573 3.453 3.575 2.948 2.158 1.870
256 1.679 2.043 2.864 3.326 3.045 3.121 3.392 2.605 1.785 1.634
320 1.534 1.757 2.659 2.991 2.845 2.998 3.172 2.878 2.056 1.633
384 1.620 1.748 2.134 2.496 2.533 2.813 2.927 2.078 1.921 1.682
512 1.596 1.808 2.151 2.506 2.421 2.648 3.094 2.617 2.065 1.717
1024 3.098 1.818 1.859 2.019 2.291 2.730 2.741 2.439 1.976 1.678
1536 2.550 1.933 1.895 1.946 2.185 2.469 2.465 2.243 1.888 1.626
2048 2.195 2.390 2.005 2.135 2.320 2.549 2.402 2.038 1.712 1.561
2560 2.201 2.164 1.902 2.072 2.265 2.656 2.527 2.163 1.862 1.667
3072 1.745 1.830 1.803 1.974 2.236 2.508 2.444 2.073 1.779 1.588
4096 1.481 1.519 1.587 1.785 2.128 2.335 2.745 2.433 2.115 1.855

There are quite significant ranges where we now get more than a 2x speedup over acf, and in some cases more than 3x.

For large enough $n$ (greater than some cutoff which varies roughly between 50 and 500 depending on the precision), nfloat and nfloat_complex convert to multimodular integer matrix multiplication, the same trick as arf and acf, so much of the improvement seen in the table comes from having better matrix multiplication for small and medium $n$.

Here, I've tried to put together all the time-saving tricks I'm aware of. This amalgation of tricks deserves a catchy name, so let's call it the Fixed-point-delayed-normalization-truncating-Karatsuba-Mulders-Winograd-Waksman-Strassen-Bodrato-Karatsuba algorithm. Let's break it down:

How much further can we optimize matrix multiplication? Well, there are still a few stones unturned:

Most of those tasks are for the distant future, however. A closer goal is to use the new matrix multiplication code to speed up arb, acb, arf and acf matrices as well. Even this will require a fair bit of work, in order to obtain rock-solid error bounds.



fredrikj.net  |  Blog index  |  RSS feed  |  Follow me on Mastodon  |  Become a sponsor