Flat-packed symbolic expressions

January 15, 2021

I have added a new module to Calcium for working with symbolic expressions, called fexpr (https://fredrikj.net/calcium/fexpr.html).

This is still highly experimental, with most features yet to be added. The most experimental aspect is that I'm using a "flat-packed" internal representation (hence the "f" prefix).

Flat-packing

The fexpr_t type stores an expression as a single, flat array of words with all subexpressions packed contiguously. For example, $(-3) + (5 \cdot x)$ or Add(-3, Mul(5, x)) is stored as a 7-word array, where the words encode the following data:

[Call2, Add, -3, Call2, Mul, 5, x]

Expressions are composed of atoms (symbols, integers and strings) and function calls. In the example, Call2 means a function call $f(a,b)$ with two arguments; this structural marker is followed by the subexpression for $f$ (here the atomic symbol Add) and the subexpressions for the arguments $a$ and $b$ (here the atomic integer -3 and the composite subexpression Mul(5, x)). The first word in each subexpression encodes that subexpression's size in words, so that the tree structure is unambiguous. In the example, the sizes are as follows: [7, 1, 1, 4, 1, 1, 1]. Here is an illustration (imagine nested boxes of furniture parts):

Small integers and builtin or short symbols are encoded in single words. Function calls with a large number of arguments, long symbol names, long strings and multi-limb integers require packing into multiple words (I will omit the details). I first considered packing into bytes instead of full words to compress the data further, but this seems to be an implementation nightmare, so words it is.

The conventional way to represent symbolic expressions in memory is using pointers, with a separate heap-allocated block of memory for each subexpression. This is typically combined with automatic memory management (garbage collection or reference counting), and a global hash table is often used to maintain unique copies of subexpressions that appear with repetition.

Flat-packing has a clear benefit: the implementation is nearly trivial. Basic operations boil down to comparing, copying and concatenating flat arrays of words (plus some fiddling to encode sizes), without the need for any memory management apart from allocating sufficient space for the array. There is no need for garbage collection or hidden global state. Expression data can be shared freely in binary form between threads and even between machines (as long as all machines have the same word size and endianness).

It also has some clear drawbacks. Mutation generally requires rebuilding the whole expression, so it is expensive to do incremental transformations. Read-only access to subexpressions of a constant expression is cheap, however, as it can be done using pointers into an existing array. Repeated subexpressions will be repeated in memory, though applications can work around this by replacing repeated composite expressions with symbol aliases.

Flat-packing is somewhat analogous to representing strings using arrays of characters instead of something like linked lists or ropes which allow more efficient in-place editing. Weighing the pros and cons, here is my rationale for flat-packing:

Just like strings, symbolic expressions are typically small (of order 10 to 1000 nodes). Comparing and copying short, flat arrays of words should be cheap compared to pointer chasing, memory management and hash table operations.
The main purpose of symbolic expressions is to act as one of the following:
- an interface between the user and computational data types (e.g. Calcium field elements, matrices, polynomials),
- an intermediate layer for conversion between computational data types,
- a format for encoding static objects,
- a format for serialization or input/output.
Most computations will be done by reading expressions, carrying out the encoded operations using computational types, and then (optionally) converting back to symbolic expressions. Since kernel operations (e.g. polynomial multiplication) are not done on symbolic expressions directly, it is thus not vitally important to support efficient in-place mutation. (This is in contrast to a system like Mathematica, where symbolic expressions are the universal data type used for everything.)
A high-performance, multithreaded memory manager and hash table designed for symbolic expressions would ultimately probably be faster, but I'm too lazy to write (and especially debug) one.

It is still too early to do any benchmarking. My assumptions about performance may turn out to be completely wrong. If this representation turns out to be inefficient, it should be possible to switch to a pointer-based format in the future, but for the moment I will try this data structure and see what happens.

Operations

So far I have implemented two interesting operations on symbolic expressions.

Numerical evaluation

Rigorous numerical evaluation of constant expressions is quite straightforward using Arb and a tree traversal. To demonstrate the functionality, I have added a Python wrapper for the C code, which outputs a decimal string:

>>> Exp = fexpr("Exp"); Log = fexpr("Log")
>>> Log(-2).nstr()
'0.6931471805599453 + 3.141592653589793*I'
>>> Exp(Log(-2)).nstr()
'-2.000000000000000 + 0e-22*I'
>>> Exp(Log(-2)).nstr(30)
'-2.00000000000000000000000000000 + 0e-36*I'
>>> LambertW = fexpr("LambertW")
>>> (LambertW(3) * Exp(LambertW(3))).nstr()
'3.000000000000000'
>>> DedekindEta = fexpr("DedekindEta"); Sqrt = fexpr("Sqrt")
>>> DedekindEta(Sqrt(-1)).nstr()
'0.7682254223260567'

It just remains to wrap 100+ more Arb and Flint functions and to implement various output formats. Eventually this will be able to do more interesting things like evaluating integrals and infinite series, requiring both symbolic and numerical processing. Many optimizations are possible (clever adaptive precision, and caching, for example).

Expansion

I have implemented a basic function to expand arithmetic expressions to a normal form. The algorithm traverses the expression tree and evaluates it as a multivariate rational function (fmpz_mpoly_q), treating any non-arithmetic subexpression (for example, x or Sqrt(2)) as a formal variable. It finally converts back to a symbolic expression:

>>> x = fexpr("x"); y = fexpr("y")
>>> (x / x**2).expanded_normal_form()
Div(1, x)
>>> (1/((1/y + 1/x))).expanded_normal_form()
Div(Mul(x, y), Add(x, y))
>>> ((x**2 - y**2) / (x - y)).expanded_normal_form()
Add(x, y)
>>> ((((x ** 0) + 3) ** 5) / 12).expanded_normal_form()
Div(256, 3)
>>> ((x+y+1)**3 - (y+1)**3 - (x+y)**3 - (x+1)**3).expanded_normal_form()
Add(Mul(-1, Pow(x, 3)), Mul(6, x, y), Mul(-1, Pow(y, 3)), -1)
>>> len(str(((1+x+y)**1000 * (x+y)).expanded_normal_form()))
196884777
>>> ((x + Sqrt(2))**2).expanded_normal_form()
Add(Pow(Sqrt(2), 2), Mul(2, Sqrt(2), x), Pow(x, 2))

This is somewhat limited in scope since it only handles rational coefficients and doesn't account for algebraic relations between the formal variables (as seen in the last example, where it doesn't know that $((\sqrt{2})^2 = 2)$. Much more interesting, of course, will be to do evaluation/simplification/normal forms with symbolic relations and with Calcium numbers. The expansion algorithm can also be improved: it doesn't exploit symbolic factors in the input (consider $(x+y)^{1000} / (x+y)^{999}$), and conversion from an expanded expression to fmpz_mpoly is done term by term which is $O(n^2)$ instead of something more reasonable like $O(n \log n)$. Such things can all be fixed, of course.

Things to do

Two years ago, I started working on a simple symbolics library in Python as part of the backend code for Fungrim. That code has served its purpose reasonably well, but it has some limitations and the codebase is a mess in need of a complete rewrite. My current plan is to do a clean and robust rewrite of all Python functionality in C (within Calcium). Things I will have to reimplement in C include:

LaTeX output.
Evaluation with variable-binding (allowing things like summations, set comprehensions, local symbol definitions).
Evaluation of predicates.
Handling assumptions (e.g. $x > 0$).
Randomized testing.
Miscellaneous symbolic simplifications.

I might not be able to do everything at once, but if everything goes well there will be a new Calcium version with basic symbolic computation support in a couple of months.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor