fredrikj.net / blog /

Benchmarking exact DFT computation

September 14, 2020

The discrete Fourier transform (DFT) is a standard numerical benchmark. What about the DFT with exact numbers? This post will compare the performance of Calcium, Sage's QQbar (exact field of algebraic numbers), Sage's SymbolicRing, SymPy, and Mathematica. The benchmark problem will be to evaluate $\textbf{x} - \operatorname{DFT}^{-1}(\operatorname{DFT}(\textbf{x}))$ where $$ \operatorname{DFT}(\textbf{x})_n = \sum_{k=0}^{N-1} \omega^{-kn} x_k, \quad \operatorname{DFT}^{-1}(\textbf{x})_n = \frac{1}{N} \sum_{k=0}^{N-1} \omega^{kn} x_k, \quad \omega = e^{2 \pi i / N}$$

for a few different input sequences $\mathbf{x} = (x_n)_{n=0}^{N-1}$. To make the benchmark meaningful, we require that $\textbf{x} - \operatorname{DFT}^{-1}(\operatorname{DFT}(\textbf{x}))$ simplifies to the zero vector. We thereby test the ability to contract (simplify) expressions and not just to expand. If a system fails to prove that some entry in the output vector is zero, it gets a (FAIL).

We test the following inputs: $x_n = n+2$, $x_n = \sqrt{n+2}$, $x_n = \log(n+2)$, $x_n = e^{2 \pi i / (n+2)}$, $x_n = \frac{1}{1+(n+2) \pi}$, and $x_n = \frac{1}{1+\sqrt{n+2} \pi}$, for lengths $N = 2,\ldots,20$ and $N = 30, 50, 100$ where reasonable.

Calcium example

This benchmark is implemented in the dft.c example program for Calcium. Here is a sample run with $N = 4$ and the sequence $x_n = \log(n+2)$:

fredrik@agm:~/src/calcium$ build/examples/dft 4 -verbose -input 2
DFT benchmark, length N = 4

[x] =
0.693147 {a where a = 0.693147 [Log(2)]}
1.09861 {a where a = 1.09861 [Log(3)]}
1.38629 {a where a = 1.38629 [Log(4)]}
1.60944 {a where a = 1.60944 [Log(5)]}

DFT([x]) =
4.78749 {a+c+3*d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)]}
-0.693147 + 0.510826*I {a*e-c*e-d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}
-0.628609 {-a-c+3*d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)]}
-0.693147 - 0.510826*I {-a*e+c*e-d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}

IDFT(DFT([x])) =
0.693147 {d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}
1.09861 {c where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}
1.38629 {2*d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}
1.60944 {a where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)], e = I [e^2+1=0]}

[x] - IDFT(DFT([x])) =
0       (= 0   T_TRUE)
0       (= 0   T_TRUE)
0       (= 0   T_TRUE)
0       (= 0   T_TRUE)

cpu/wall(s): 0.007 0.006
virt/peak/res/peak(MB): 36.28 36.28 8.75 8.75

A comment on the print format: 4.78749 {a+c+3*d where a = 1.60944 [Log(5)], b = 1.38629 [Log(4)], c = 1.09861 [Log(3)], d = 0.693147 [Log(2)]} means that the symbolic representation is $a+c+3d$ where $a = \log(5), c = \log(3), d = \log(2)$ and that the approximate numerical value is 4.78749.

The output of the inverse DFT is not exactly the same as the original vector $\mathbf{x}$, in terms of its symbolic representation: the fields of the coefficients have some "memory" of the other coefficients encountered in the computation, and $\log(4)$ has been transformed into $2 \log(2)$. Nevertheless, on subtraction of the two vectors, we get precisely zeros.

Benchmark implementation details

We evaluate the DFT by simple $O(N^2)$ summation. The purpose is to test symbolic arithmetic and simplification, and using an $O(N \log N)$ FFT algorithm does not matter much in this setting. We compute an array of roots of unity $1,\omega,\omega^2,\ldots,\omega^{2N-1}$ at the start for use in both the DFT and inverse DFT.

In Sage, SymPy and Mathematica, the result does not usually expand to 0 automatically, so we have to call .simplify_full(), FullSimplify[] or equivalent. In Calcium, it turns out that the result usually expands to 0 automatically, though in some cases we have to call ca_check_is_zero().

I don't have a copy of Mathematica (nor Maple), so for this benchmark I timed Mathematica on the free Wolfram Cloud which has a 1-minute timeout. I report (TIMEOUT) in cases where it did not finish. For other systems, I also report (TIMEOUT) where the computation did not finish in a few minutes. It should be noted that the hardware for the Wolfram Cloud is different from my own computer, so the absolute timings are not directly comparable (my CPU seems to be slower, but loads may vary). I used a single thread everywhere.

The systems cache various data, and as a result running the same code a second time can be much faster than the first time. For this benchmark, I have timed the first run (for each separate $N$) starting with an empty cache. In SymPy and Mathematica, this meant explicitly calling clear_cache() and ClearSystemCache[] at the start of the run. In Calcium, I include the time of setting up and destroying the context object.

I could have benchmarked the performance with a hot cache instead, but the data would not really be meaningful for cross-system comparison because different systems cache different kinds of information (theoretically, a system could cache the whole computation), and frankly it would have been too tedious to run the separate computations for all the systems and present the data. The downside is that Calcium looks worse than it should for small $N$, since in practice you'd often want to recycle the same context object for many calculations.

Test A: $x_n = n+2$

All timings are in seconds.

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 0.0023 0.013 0.0012 0.00015 0.00040
3 0.0059 0.063 (FAIL) 0.039 0.00048
4 0.0068 0.020 0.015 0.00023 0.00044
5 0.019 0.060 (FAIL) 0.26 0.00051
6 0.020 0.047 (FAIL) 0.078 0.00049
7 0.031 (FAIL) (FAIL) 0.55 0.00050
8 0.033 0.11 1.1 0.057 0.00048
9 0.050 (FAIL) (FAIL) 0.52 0.00052
10 0.060 5.7 (FAIL) 0.49 0.00052
11 0.081 (FAIL) (FAIL) 0.87 0.00062
12 0.069 1.8 (FAIL) 0.26 0.00053
13 0.11 (FAIL) (FAIL) 1.2 0.00075
14 0.11 (FAIL) (FAIL) 0.82 0.00062
15 0.12 (FAIL) (FAIL) 26 0.00071
16 0.15 36 9.9 0.27 0.00068
17 0.22 (FAIL) (FAIL) 7.9 0.0011
18 0.18 (FAIL) (FAIL) 0.84 0.00071
19 0.27 (FAIL) (FAIL) 2.6 0.0014
20 0.22 124 (FAIL) 0.96 0.00081
30 0.49 (FAIL) (FAIL) 44 0.0013
50 1.9 (FAIL) (FAIL) (TIMEOUT) 0.0076
100 9.2 (FAIL) (FAIL) (TIMEOUT) 0.045

Comment: this is effectively a test of arithmetic in the cyclotomic field $\mathbb{Q}(\omega)$. I'm actually surprised that Sage's QQbar performs so poorly, because it is supposedly optimized for arithmetic in absolute number fields. Sage's SymbolicRing fails to simplify the result when $N$ has too many or large divisors, and SymPy only succeeds when $N$ is a power of two (this will be the case in all the other tests as well). For small $N$, Calcium's timings mostly show the overhead of precomputations: for example, if the context object were reused, $N = 4$ would take 0.000017 seconds (24 times faster). Anyway, Calcium is the clear champion when $N$ grows.

Test B: $x_n = \sqrt{n+2}$

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 0.015 0.0096 0.0025 0.00017 0.00075
3 0.042 0.074 (FAIL) 0.052 0.0029
4 0.23 0.022 0.17 0.0069 0.0032
5 0.33 0.069 (FAIL) 0.45 0.011
6 8.9 0.56 (FAIL) 0.17 0.017
7 (TIMEOUT) (FAIL) (FAIL) 1.0 0.015
8 5.3 0.50 2.8 0.11 0.017
9 (TIMEOUT) (FAIL) (FAIL) 1.6 0.022
10 (TIMEOUT) 7.5 (FAIL) 0.75 0.033
11 (TIMEOUT) (FAIL) (FAIL) 2.5 0.045
12 (TIMEOUT) 3.0 (FAIL) 0.31 0.086
13 (TIMEOUT) (FAIL) (FAIL) 2.9 0.080
14 (TIMEOUT) (FAIL) (FAIL) 2.0 0.060
15 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.083
16 (TIMEOUT) 46 24 0.58 0.090
17 (TIMEOUT) (FAIL) (FAIL) 12 0.14
18 (TIMEOUT) (FAIL) (FAIL) 2.8 0.14
19 (TIMEOUT) (FAIL) (FAIL) 7.8 0.20
20 (TIMEOUT) 154 (FAIL) 2.3 0.17
30 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.54
50 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 2.6
100 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 38

Comment: this is effectively a test of multivariate algebraic number fields. It's also where Sage's QQBar starts to show its big weakness: since it only uses absolute fields internally for simplification, it becomes completely useless as soon as you mix different algebraic numbers that don't fit neatly into an absolute field. Other than that, the picture is similar to the previous: Sage's SymbolicRing and SymPy do a decent job for the few $N$ where they work, Mathematica runs just fine (although $N = 15$ takes over a minute for some reason), and Calcium handles everything easily.

Test C: $x_n = \log(n+2)$

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 - 0.0094 0.0031 0.00019 0.00047
3 - 0.073 (FAIL) 0.051 0.0014
4 - 0.021 0.16 0.036 0.0021
5 - 0.60 (FAIL) 3.8 0.0050
6 - 0.51 (FAIL) 0.19 0.0028
7 - (FAIL) (FAIL) 20 0.0074
8 - 0.20 1.8 0.29 0.0059
9 - (FAIL) (FAIL) 11 0.0083
10 - 6.7 (FAIL) 7.2 0.010
11 - (FAIL) (FAIL) (TIMEOUT) 0.021
12 - 3.0 (FAIL) 0.42 0.061
13 - (FAIL) (FAIL) (TIMEOUT) 0.032
14 - (FAIL) (FAIL) 41 0.021
15 - (FAIL) (FAIL) (TIMEOUT) 0.030
16 - 44 17 0.66 0.025
17 - (FAIL) (FAIL) (TIMEOUT) 0.061
18 - (FAIL) (FAIL) 5.9 0.033
19 - (FAIL) (FAIL) (TIMEOUT) 0.082
20 - 136 (FAIL) 45 0.046
30 - (FAIL) (FAIL) (TIMEOUT) 0.19
50 - (FAIL) (FAIL) (TIMEOUT) 1.3
100 - (FAIL) (FAIL) (TIMEOUT) 26

Comment: similar to the previous but with transcendental numbers instead of algebraic numbers. Obviously, Sage's QQbar is no longer applicable. Mathematica struggles a bit here, perhaps because of relations between the logarithms.

Test D: $x_n = e^{2 \pi i / (n+2)}$

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 0.0015 0.014 0.0038 0.00022 0.00051
3 0.0081 0.088 (FAIL) 0.024 0.00093
4 0.10 0.022 0.20 0.011 0.0034
5 0.16 1.1 (FAIL) 2.6 0.0040
6 (TIMEOUT) 1.1 (FAIL) 0.17 0.0078
7 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.013
8 (TIMEOUT) 1.3 (FAIL) 0.10 0.019
9 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.019
10 (TIMEOUT) 13 (FAIL) (TIMEOUT) 0.028
11 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.093
12 (TIMEOUT) 7.9 (FAIL) 8.7 0.13
13 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.12
14 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.16
15 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.23
16 (TIMEOUT) 78 (TIMEOUT) 0.41 0.32
17 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.26
18 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.26
19 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 0.35
20 (TIMEOUT) 277 (FAIL) (TIMEOUT) 1.1
30 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 1.8
50 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 36
100 (TIMEOUT) (FAIL) (FAIL) (TIMEOUT) 699

Comment: not enough roots of unity in the DFT to make things interesting? Then throw in more roots of unity :-) At $N = 100$ Calcium finished just when I was about to hit CTRL+C.

Test E: $x_n = 1/(1+(n+2)\pi)$

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 - 0.0093 0.0021 0.00030 0.00044
3 - 0.081 (FAIL) 0.38 0.00074
4 - 0.021 1.9 0.0081 0.00077
5 - 0.69 (FAIL) 0.67 0.0032
6 - 0.57 (FAIL) 0.11 0.0017
7 - (FAIL) (FAIL) 1.8 0.0072
8 - 0.68 17 0.21 0.0041
9 - (FAIL) (FAIL) 2.5 0.0078
10 - 8.0 (FAIL) 1.5 0.0099
11 - (FAIL) (FAIL) 6.9 0.031
12 - 3.7 (FAIL) 1.2 0.019
13 - (FAIL) (FAIL) 11.9 0.056
14 - (FAIL) (FAIL) 4.9 0.035
15 - (FAIL) (FAIL) 19 0.051
16 - 48 (TIMEOUT) 6.4 0.046
17 - (FAIL) (FAIL) 35 0.15
18 - (FAIL) (FAIL) 9.0 0.06
19 - (FAIL) (FAIL) 20 0.23
20 - 167 (FAIL) (TIMEOUT) 0.12
30 - (FAIL) (FAIL) (TIMEOUT) 0.66
50 - (FAIL) (FAIL) (TIMEOUT) 8.9
100 - (FAIL) (FAIL) (TIMEOUT) 216

Comment: throwing in a denominator so that we exercise rational function arithmetic.

Test F: $x_n = \frac{1}{1+\sqrt{n+2} \pi}$

Plot of timings
N Sage QQBar Sage SymbolicRing SymPy Mathematica Calcium
2 - 0.0095 0.0022 0.00028 0.00071
3 - 0.088 (FAIL) 0.057 0.0048
4 - 0.021 2.19 0.017 0.0049
5 - 0.93 (FAIL) 1.9 0.021
6 - 0.72 (FAIL) 0.39 0.030
7 - (FAIL) (FAIL) 9.6 0.087
8 - 0.76 22 2.6 0.082
9 - (FAIL) (FAIL) 32 0.18
10 - 178 (FAIL) 13 0.30
11 - (FAIL) (FAIL) (TIMEOUT) 0.93
12 - 11 (FAIL) (TIMEOUT) 0.83
13 - (FAIL) (FAIL) (TIMEOUT) 3.7
14 - (FAIL) (FAIL) (TIMEOUT) 2.5
15 - (FAIL) (FAIL) (TIMEOUT) 3.8
16 - (TIMEOUT) (TIMEOUT) (TIMEOUT) 8.1
17 - (FAIL) (FAIL) (TIMEOUT) 24
18 - (FAIL) (FAIL) (TIMEOUT) 23
19 - (FAIL) (FAIL) (TIMEOUT) 84
20 - (FAIL) (FAIL) (TIMEOUT) 43
30 - (FAIL) (FAIL) (TIMEOUT) 1720
50 - (FAIL) (FAIL) (TIMEOUT) (TIMEOUT)
100 - (FAIL) (FAIL) (TIMEOUT) (TIMEOUT)

Comment: similar to the previous example, except that the rational functions involve several variables.

Conclusion

Calcium turns out to do very well on this benchmark. I'm pleasantly surprised, because I expected that I'd have to do a lot more tweaking to even get it to work. I think to some extent this validates the design of Calcium. The usual disclaimer applies: Calcium is experimental, and future versions could be slower or faster. Notably, for small $N$, the performance from a cold start can still be improved; it's not really that important because most of the time you will want to reuse a context object, but it's a question of aesthetics :-)

Appendix: benchmarking code snippets

Just for reference.

check[NNN_] := Module[{NN = NNN},
ClearSystemCache[];
xs = Table[i+2, {i, 0, NN-1}];
ws = Table[Exp[2 Pi I k / NN], {k, 0, 2 NN - 1}];
ys = Table[Sum[xs[[k+1]]* ws[[1+Mod[-k*n, 2 NN]]], {k, 0, NN-1}], {n, 0, NN - 1}];
xs2 = Table[FullSimplify[Sum[ys[[k+1]]* ws[[1+Mod[k*n, 2 NN]]], {k, 0, NN-1}] / NN == xs[[n+1]]], {n, 0, NN - 1}]];
n=11; reps=1; {n, Timing[Do[check[n],{i,1,reps}]][[1]]/reps}


def benchSymPy(N, which=0, verbose=False):
    if which == 0:
        x = [S(i+2) for i in range(N)]
    elif which == 1:
        x = [sqrt(i+2) for i in range(N)]
    elif which == 2:
        x = [log(i+2) for i in range(N)]
    elif which == 3:
        x = [exp(2*pi*I/(i+2)) for i in range(N)]
    elif which == 4:
        x = [1/(1+(i+1)*pi) for i in range(N)]
    elif which == 5:
        x = [1/(1+sqrt(i+1)*pi) for i in range(N)]
    w1 = exp(2*pi*I/N)
    w = [w1**i for i in range(2*N)]
    y = [S(0) for i in range(N)]
    z = [S(0) for i in range(N)]
    for k in range(N):
        for n in range(N):
            y[k] += x[n] * w[(-k * n) % (2*N)]
    for k in range(N):
        for n in range(N):
            z[k] += y[n] * w[(k * n) % (2*N)]
        z[k] /= N
        eq = simplify(z[k] - x[k])
        if eq != 0:
            print(eq)
            raise ValueError


def benchQQbar(N, which=0):
    if which == 0:
        x = [QQbar(i+2) for i in range(N)]
    elif which == 1:
        x = [QQbar(i+2).sqrt() for i in range(N)]
    elif which == 3:
        x = [QQbar.zeta(i+2) for i in range(N)]
    w1 = QQbar.zeta(N)
    w = [w1**i for i in range(2*N)]
    y = [QQbar(0) for i in range(N)]
    z = [QQbar(0) for i in range(N)]
    for k in range(N):
        for n in range(N):
            y[k] += x[n] * w[(-k * n) % (2*N)]
    for k in range(N):
        for n in range(N):
            z[k] += y[n] * w[(k * n) % (2*N)]
        z[k] /= N
        if z[k] - x[k] != 0:
            raise ValueError

def benchSR(N, which=0, verbose=False):
    if which == 0:
        x = [SR(i+2) for i in range(N)]
    elif which == 1:
        x = [SR(i+2).sqrt() for i in range(N)]
    elif which == 2:
        x = [SR(i+2).log() for i in range(N)]
    elif which == 3:
        x = [SR(exp(2*pi*I/(i+2))) for i in range(N)]
    elif which == 4:
        x = [SR(1/(1+(i+1)*pi)) for i in range(N)]
    elif which == 5:
        x = [SR(1/(1+sqrt(i+1)*pi)) for i in range(N)]
    w1 = SR(exp(2*pi*I/N))
    w = [w1**i for i in range(2*N)]
    y = [SR(0) for i in range(N)]
    z = [SR(0) for i in range(N)]
    for k in range(N):
        for n in range(N):
            y[k] += x[n] * w[(-k * n) % (2*N)]
    for k in range(N):
        for n in range(N):
            z[k] += y[n] * w[(k * n) % (2*N)]
        z[k] /= N
        eq = (z[k] - x[k]).full_simplify()
        if eq != 0:
            raise ValueError

Update: Daniel Schultz suggested using omega = ToNumberField[(-1)^(2/NN)]; ws = omega^Range[0, 2*NN-1]; in Mathematica. This seems to help on test A (expectedly) while it slows down the others.