August 26, 2016

The preprint Short addition sequences for theta functions (arXiv:1608.06810) by Andreas Enge, Bill Hart and myself is now available. We prove some nice new theorems about integer sequences of a very simple kind, namely successive values of quadratic polynomials such as the ordinary squares ($f(k) = k^2$). The purpose is actually to use the structure present in these sequences to speed up numerical evaluation of classical modular forms and functions such as the Dedekind eta function $\eta(\tau)$ and the j-invariant $j(\tau)$. The latest versions of Arb and CM implement the methods in the paper, resulting in a 4x speedup in practice over the classical method for evaluating these functions.

## The eta function

Consider the series expansion of the eta function: $$\eta(\tau) = q^{1/24} \sum_{k=-\infty}^{\infty} (-1)^k q^{k(3k-1)/2} = q^{1/24} \left[ 1 - q - q^2 + q^5 + q^7 - q^{12} - q^{15} + \ldots \right]$$ The exponents $0, 1, 2, 5, 7, 12, 15, 22, 26, 35, 40, \ldots$ are the generalized pentagonal numbers, of Euler fame. Our goal is to compute the occurring powers of $q$ using as few complex multiplications as possible ($q = e^{2\pi i \tau}$ is a given complex number). Equivalently, we want to generate the occurring exponents using as few additions as possible: if $c = a + b$, then we get $q^c$ by computing $q^a \cdot q^b$. With different terminology, we want to find a short addition sequence that includes all the generalized pentagonal numbers as a subsequence. This is a variation of the well-known problem of finding a short addition chain to compute a single power $q^k$, but instead of considering a single exponent, we consider a sequence of exponents together.

The classical solution for the eta series is to introduce auxiliary sequences of finite differences: $$5 - 1 = 4, \quad 12 - 5 = 7, \quad 22 - 12 = 10, \ldots$$ $$7 - 2 = 5, \quad 15 - 7 = 8, \quad 26 - 15 = 11, \ldots$$ Since these sequences can be generated using one addition per term (they have a constant step size 3), we need two additions per term to construct the generalized pentagonal number sequence. In other words, we need $2N$ multiplications to compute the first $N$ terms of the eta function's $q$-series. However, inspection shows that some of the generalized pentagonal numbers can be written as the sum of two smaller generalized pentagonal numbers without relying on the auxiliary sequences. For example, $2 = 1 + 1, 7 = 2 + 5, 12 = 5 + 7, 22 = 7 + 15$. This doesn't work for all entries, but if it works for most entries, and if the remaining entries can be computed quickly enough, then asymptotically we should be able to get away with just $N + o(N)$ multiplications. We show that this, essentially, is the case:

Theorem: Assuming a standard conjecture about primes (Bateman-Horn), almost all generalized pentagonal numbers $c$ can be written as $c = a+b$ where $a,b$ are smaller generalized pentagonal numbers; the probability that no such representation exists is heuristically $O(1/\log c)$.

Theorem: Every generalized pentagonal number $c \ge 2$ can be written as $c = 2a+b$ where $a, b$ are smaller generalized pentagonal numbers.

In other words, we can compute each new term $q^c$ in the eta function's $q$-series as $q^c = (q^a)^2 \cdot q^b$, where $q^a$ and $q^b$ are previously computed terms. Moreover, if we use this only as a fallback and try $q^c = q^a \cdot q^b$ first, then (assuming Bateman-Horn) computing the first $N$ terms takes only $N + O(N / \log N)$ multiplications, compared to $2N$ with the classical method using finite differences.

You can easily check that solutions of $c=2a+b$ exist for the first few $c$. Here is some Python code and the output for $c < 1000$:

from itertools import takewhile

def pentagonal():
a = 0; b = 1; c = 2; d = 4
while 1:
yield a; a += c; c += 3
yield b; b += d; d += 3

# naive code; the search can easily be done much faster
def solutions(c):
return [(a,b) for a in takewhile(lambda a: a < c, pentagonal())
for b in takewhile(lambda b: b < c, pentagonal())
if 2*a + b == c]

for c in takewhile(lambda c: c < 1000, pentagonal()):
print c, solutions(c)

0 []
1 []
2 [(1, 0)]
5 [(2, 1)]
7 [(1, 5)]
12 [(5, 2)]
15 [(5, 5), (7, 1)]
22 [(5, 12)]
26 [(2, 22), (7, 12), (12, 2)]
35 [(15, 5)]
40 [(7, 26)]
51 [(22, 7)]
57 [(26, 5)]
70 [(15, 40), (22, 26), (35, 0)]
77 [(35, 7)]
92 [(26, 40), (35, 22), (40, 12)]
100 [(15, 70)]
117 [(51, 15)]
126 [(57, 12)]
145 [(70, 5)]
155 [(5, 145), (70, 15), (77, 1)]
176 [(77, 22)]
187 [(35, 117)]
210 [(70, 70), (92, 26)]
222 [(100, 22)]
247 [(51, 145)]
260 [(117, 26)]
287 [(126, 35)]
301 [(7, 287), (57, 187), (92, 117)]
330 [(35, 260), (77, 176), (145, 40)]
345 [(22, 301), (100, 145), (155, 35)]
376 [(77, 222), (100, 176), (187, 2)]
392 [(176, 40)]
425 [(40, 345), (187, 51), (210, 5)]
442 [(210, 22)]
477 [(26, 425), (145, 187), (210, 57)]
495 [(35, 425), (222, 51), (247, 1)]
532 [(70, 392), (155, 222), (260, 12)]
551 [(247, 57)]
590 [(260, 70)]
610 [(117, 376)]
651 [(287, 77)]
672 [(70, 532), (301, 70), (330, 12)]
715 [(145, 425)]
737 [(330, 77)]
782 [(345, 92)]
805 [(77, 651), (155, 495), (330, 145)]
852 [(35, 782), (376, 100), (425, 2)]
876 [(12, 852), (392, 92), (425, 26)]
925 [(187, 551)]
950 [(287, 376), (345, 260), (425, 100)]


It's a bit surprising that this always works. Many $c$ have just a single solution; there is no asymptotic trend to suggest that the first few cases are anything but coincidences. Meanwhile, some $c$ do have many solutions, so there is no obvious bijection at work either. We discovered the result empirically: while tracing some code that tried several different term combinations for constructing addition sequences, we noticed that the $2a+b$ branch alone worked for the first several thousand $c$. Proving the conjecture then took a bit of work (the proof uses mostly basic algebraic number theory, but it's not trivial).

## The same for theta

We also prove similar results relevant for the computation of Jacobi theta functions. Here we are mainly interested in the so-called theta constants which are given by the three series $$\sum_{k=1}^{\infty} q^{k^2}, \quad \sum_{k=1}^{\infty} (-1)^k q^{k^2}, \quad \sum_{k=1}^{\infty} (-1)^k q^{k(k+1)}.$$ These theta series are not the most general possible, but they are sufficient for computation of $\operatorname{SL}_2(\mathbb{Z})$ modular forms and functions, and for special values of elliptic functions. The exponents $k^2 = 1,4,9,\ldots$ are of course the squares, and we call $k(k+1) = 2,6,12,\ldots$ the trigonal numbers (these are twice the more well-known triangular numbers).

Just like with the pentagonal numbers, the classical algorithm to evaluate any one of the theta series uses the differences between successive exponents. This requires $2N$ multiplications for $N$ terms. We prove (again subject to Bateman-Horn) that $N + O(N / \log N)$ multiplications suffice. Our counterparts of the "$2a+b$" theorem for pentagonal numbers are the following:

Theorem: Every trigonal number $c \ge 6$ is the sum of at most three smaller ones.

Theorem: Every almost-square $c \ge 24$ is the sum of at most three smaller almost-squares (an almost-square is an integer $c = k^2 - 1$).

In practice, the three theta series are usually needed together, so it's useful to consider the quarter-squares $1, 2, 4, 6, 9, 12, 16, \ldots$ formed by interleaving squares and trigonal numbers. We show the following:

Theorem: Every quarter-square $c > 1$ is the sum of a smaller one and twice a smaller one.

We can do a little better yet if we interleave the exponents $k(k+1)$ with $(2k+1)^2-1$ and $(2k)^2$, where we have split the squares into the odd and even terms and peeled off one factor $q$ from the powers of $q$ with an odd exponent so that all exponents are even. The interleaved sequence $2, 4, 6, 8, 12, 16, 20, 24, 30, 36, \ldots$ can be written in closed form as $2 \lfloor k^2 / 8 \rfloor$. We show the following:

Theorem: Every element $c \ge 4$ in the sequence $2 \lfloor k^2 / 8 \rfloor$ is the sum of two smaller ones.

The last theorem shows that we can compute all three theta series to $N$ terms each using $2N$ multiplications altogether, in contrast to the $4N$ multiplications required with the classical finite difference method. Note that this complexity result does not depend on any conjectures.

## Baby steps

There is one more trick before we get to the end of the paper. We clearly need at least $N$ multiplications to compute $N$ distinct powers of $q$, but we might get away with fewer multiplications to compute a sum of powers if we can avoid computing every single power explicitly.

The idea of the baby-step giant-step algorithm for polynomial evaluation, also called rectangular splitting, is to write \begin{align*} \sum_{k=0}^T c_k q^k & = [c_0 + c_1 q + \ldots + c_{m-1} q^{m-1}] \\ & + [c_m + c_{m+1} q + \ldots + c_{2m-1} q^{m-1}] q^m \\ & + [c_{2m} + c_{2m+1} q + \ldots + c_{3m-1} q^{m-1}] q^{2m} \\ & \vdots \\ \end{align*} for some tuning parameter $m$. The coefficients $c_k$ are assumed to be cheap to multiply with, in our case $c_k \in \{-1,0,+1\}$. We need about $m + T / m$ complex multiplications with this scheme: $m$ to precompute the powers $q^2, q^3, \ldots q^{m-1}$ which allows evaluating all the polynomials inside brackets quickly, and $T / m$ complex multiplications to evaluate the outer polynomial in $q^m$. In the dense case, that is when most $c_k \ne 0$, it is optimal to take $m \approx T^{1/2}$, giving $2 T^{1/2}$ total complex multiplications. The truncated eta and theta series are very sparse, however: we have only $N \sim T^{1/2}$ nonzero $c_k$ for $k \le T$, and it turns out that this choice of $m$ gives no improvement over the classical approach of computing all powers explicitly.

In the sparse case, we can do better by choosing $m$ in such a way that only a subset of $q^2, q^3, \ldots, q^{m-1}$ appear with a nonzero coefficient in the inner polynomials. This means that we want to find parameters $m$ such that the squares, trigonal numbers or generalized pentagonal numbers respectively take few distinct values modulo $m$. For example, if $s(m)$ is the number of distinct square residues modulo $m$, we need about $s(m) + N^2 / m$ complex multiplications to evaluate $\sum_{k=1}^N q^{k^2}$ rather than $m + N^2 / m$. It now makes sense to look for $m$ such that $s(m) / m$ is small. The list of optimal $m$ values for the sequence of squares is OEIS A085635, also listed in Table 4 in the paper. For example, $s(3600) = 176$, so if we pick $m = 3600$, only $176 / 3600 \approx 5\%$ of the powers $q^2, q^3, \ldots, q^{m-1}$ need to be computed.

In the paper, we determine how to choose $m$ nearly optimally for an arbitrary quadratic exponent sequence $F(k) = Ak^2 + Bk + C \in \mathbb{Q}[k]$, giving squares, trigonal numbers and pentagonal numbers as a special case, and we show that $\sum_{k=1}^N q^{F(k)}$ can be computed using only $O(N / \log N)$ multiplications by this approach. Actually, the asymptotic complexity is slightly better than this: it is $N^{1 - c / \log \log N}$ for a certain constant $c > 0$, and this is better than $O(N / \log^r N)$ for any $r > 0$.

## The conclusion

To summarize, we have gone from $2N$ (the classical method) to $N + o(N)$ (short addition sequences) to $O(N / \log N)$ (baby-step giant-step) multiplications for summing the first $N$ terms in the $q$-series of the Dedekind eta function or the Jacobi theta function constants.

Asymptotically, arithmetic-geometric mean (AGM) iteration is better than evaluation of the $q$-series, as it only costs $O(\log N)$ multiplications. However, the AGM method has a large implied big-O constant, so the $q$-series are usually faster in practice, particularly when optimized.

The end of the paper presents some benchmark results. Here is a copy of the final table, which shows the time in seconds to compute one value of the Dedekind eta function to varying levels of precision:

 Bits Pari/GP CM-0.2.1 AGM New CM-0.3 New Arb $10^4$ 0.00869 0.00457 0.00789 0.00297 0.00272 $10^5$ 0.654 0.284 0.245 0.164 0.147 $10^6$ 29.9 10.9 6.78 4.77 4.85 $10^7$ 1310 440 124 150 151

Both CM-0.3 and Arb use the new baby-step giant-step algorithm, and have nearly identical timings. The old CM-0.2.1 uses a short addition sequence to evaluate the $q$-series using approximately one multiplication per term. Pari/GP uses the classical method. In practice, about a factor two is saved both when going from Pari/GP to CM-0.2.1. and from CM-0.2.1 to CM-0.3.

An implementation of the AGM method is also timed for comparison. Note that the crossover for the AGM method is less than $10^4$ with the unoptimized $q$-series evaluation in Pari/GP but close to $10^7$ bits with the optimized $q$-series evaluation in the latest CM and Arb.

In other settings, baby-step giant-step techniques can easily save a factor 100. The reason why the improvement is not so dramatic here is that the $q$-series are so sparse that it takes very little work to evaluate them even with naive methods. There just isn't much work left to remove by being clever! Nonetheless, in applications where we need tens of thousands of function values, saving a small factor can make a difference.

This paper generated a few interesting "new" integer sequences. We should probably submit them to OEIS...