fredrikj.net / blog /

# The symbolic formula language in Calcium and Fungrim

*January 31, 2021*

In a recent post, I wrote about my approach for the low-level representation of symbolic expressions in C. Today, I will discuss the design of the formula language for representing mathematical content on top of symbolic expressions. This language consists of predefined ("builtin") symbol names and associated calling conventions for objects, functions and operators.

I originally developed the formula language for Fungrim, for which I wrote a Python implementation to handle LaTeX output and basic symbolic evaluation. I am now repurposing the language for Calcium, which will entail translating the backend from Python to C and also greatly extending the computational functionality. I'm currently about 80% done reimplementing the expression-to-LaTeX converter in C; you can see the current test suite output for the C version here.

The formula language is inspired by the Wolfram Language
and by conventional mathematical notation,
and to a lesser extent by other
computer algebra systems, Python, and LaTeX.
It is primarily designed for expressing formulas,
and *not* for general purpose programming
(though it does allow encoding simple functional programs).
This immediately rules out many constructs that would be
convenient for programming but which just don't map well
to conventional mathematical notation.
Another constraint is that I want the LaTeX conversion
to be context-free; in general, the rendered output should
depend only on the symbols and their positions in function calls,
and not on the values assigned to variables in a surrounding context.

As part of the reimplementation work, I'm making tweaks to the language in an attempt to remove warts and improve clarity and versatility. Besides providing an overview of the formula language, this post will discuss some of the unsolved design problems.

## Expression composition

The underlying symbolic expression format consists of atoms
and composite expressions. Atoms are symbols (`x`, `Integral`), strings (`"Hello"`) or integers (`-12345`).
Composite expressions are formal function calls (`f(x1, x2, ..., xn)`).
All formulas must be represented using these primitives; complex atomic
datatypes are excluded by design to make symbolic expressions easy
to store, traverse and convert to other formats.

## Names

Builtin objects and operators have CamelCase names.
Most objects have verbose names
(for example, `BesselJ(n, x)` for $J_n(x)$, `ClosedOpenInterval(a, b)` for $[a, b)$,
`IdentityMatrix(n)` for $I_n$),
with the exception of some very common objects where an
abbreviated name is clearly recognizable (for example, `Mul(x, y, z)` for $xyz$,
`Abs(x)`
for $|x|$, and `ZZ` for $\mathbb{Z}$).

The CamelCase convention is taken from Wolfram, and of course coincides with the convention for naming classes in many object-oriented languages such as Python. The reason is mainly aesthetical; I don't prefer CamelCase in most contexts, but it feels natural for naming mathematical objects in symbolic expressions (perhaps because they look more "official" that way). It also has some practical benefits:

- It unambiguously leaves all symbol names starting with a lowercase letter for user symbols. (I am still undecided about allowing users to shadow builtin symbol names.)
- There is no confusion over whether multi-word names contain underscores (only
`IsPrime`is possible; there is no need to remember whether it is`isprime`or`is_prime`).

All lowercase letters `a-z` and uppercase letters `A-Z` are
reserved for user variables.
Names such as `alpha`, `beta`, etc. give the Greek
symbols $\alpha$, $\beta$, etc. when rendered to LaTeX.
As a special case, I decided to use the alternative spelling
for `lamda` for $\lambda$ to avoid an inconvenient
clash with a keyword in Python (there are some other collisions of this kind;
for example `True`/`False` collide with the boolean constants
in Python, but with automatic conversions this turns out
to be a very minor problem).

A cute hack I came up with is to recognize underscore-suffixed
symbol names as a way to render subscripts automatically for user variables.
For example, `a(n)` renders as $a(n)$ in LaTeX,
while `a_(n)` renders as $a_n$. You can even write `gamma_(n,m)(x,y,z)`
for $\gamma_{n,m}(x,y,z)$.
It's totally a hack (there's some risk of confusion since `a` and `a_` are separate variables
and not implicitly related), but writing down sequences
this way felt natural instantly the
moment I started using it in Fungrim, so it will stay.

## Naming things

Naming things is hard. Sometimes very hard. Some of the builtin names I have chosen (in many cases, copied directly from Wolfram) are perfect, but some are clumsy or mutually inconsistent and I'm sure I will continue renaming things for a long time. This is one area where feedback from users would be extremely useful.

Here is an example of a rather silly naming problem I'm wrestling with:
the naming of the imaginary unit $i$.
`ImaginaryUnit` would be perfect, except that is extremely long
for an object that is so common.
I'm currently using `NumberI`, which is explicit enough
but still relatively long and not very standard.
`Im` would have been acceptable, but it is used for
the imaginary part function, which seems like a too common operation to name `ImaginaryPart`.
Wolfram and many other systems reserve `I` and/or `i` to denote the imaginary unit,
but these are useful variable names.
Question: is the imaginary unit important enough to reserve `I`
as a builtin constant, forcing the user to use some nonobvious symbol name
(`upperI` or `varI` perhaps) if they want a variable named $I$?
Note that it doesn't make sense to use `I` for the imaginary
unit and also allow the user to shadow it as a variable, because the LaTeX
output should display the imaginary unit as $i$, and it is
meant to be context-free.

The same issue exists for $e$, but it is less important to have a short name
here since `Exp(x)` works well for denoting the exponential function;
the number is rarely needed on its own.
I already decided to reserve `Pi` for the number $\pi$ and `Gamma`
for the gamma function; the uppercase Greek letters $\Pi$ and $\Gamma$
(as a variables) are currently named `GreekPi` and `GreekGamma`.
`N` should in any case remain a variable name; the
Wolfram language always annoys me
when I can't sum `n` from 0 to `N`.

## Arithmetic expressions

Arithmetic operations and mathematical functions
(`Add`, `Mul`, `Div`,
`Pow`, `Sqrt`, `Exp`, `Log`, `Gamma`, etc.)
are represented using simple function calls.
There is no notion of infix operators in the expression format itself,
but infix syntax can be handled at a parser level or in
wrappers. For example, typing `x + y * z` in Python generates `Add(x, Mul(y, z))`.

The LaTeX converter inserts parentheses automatically to indicate
the correct precedence in displayed formulas. It also removes some parentheses and
signs where they are redundant or would look ugly, by assuming associativity.
For example, `Add(x, y, z)` and `Add(x, Add(y, z))`
both render as $x + y + z$, and
`Add(x, Div(-2, 3))` renders as $x - \frac{2}{3}$
rather than as $x + \frac{-2}{3}$. This behavior will be configurable
for applications where it is vital to display the exact internal structure of arithmetic
expressions.
It is also possible to insert manual parentheses and brackets
to clarify grouping; for example,
`Add(x, Div(Parentheses(-2), 3))`
renders as $x + \frac{\left(-2\right)}{3}$ (semantically,
`Parentheses` can be understood as representing the identity function).

The representation of arithmetic expressions is intended to be
simple, not to be as compact as possible. The monomial
`Mul(3, Pow(x, 2), Pow(y, 4))` ($3 x^2 y^4$)
could be encoded more compactly as something like `Monomial(3, 1, x, 2, y, 4)`.
An entire polynomial could be encoded even more compactly
using arrays of coefficients and exponents.
Such constructions might be added at a later date if they
turn out to be essential for performance (for the moment, I'm assuming
that Flint and Calcium polynomial types will be used for polynomial computations
and that symbolic expressions will be used only as an interface to those
types, so squeezing out the last inch of performance is not essential).

## Operators and generator expressions

*Operators* (not to be confused with arithmetic operators)
are builtin symbols that
express some transformation applied to a function or a set.
Examples include `Sum`, `DivisorSum`,
`PrimeSum`, `Product`, `Integral`,
`Limit`, `Derivative`,
`Minimum`, `Supremum`, `ArgMax`,
`Solutions`, `UniqueSolution`, `Zeros`, and others.
Rather than acting as normal functions taking constant values as input,
operators interpret some of their input
expressions as functions with respect to locally
bound variables.

All operators use the same syntax for binding variables: the special `For`-expression.
The generator expression `For(n, ...)` binds `n` as a dummy variable
in the scope of the parent expression. The additional arguments `...`
specify the range of the iteration (lower or upper bounds, iteration set, etc.)
or localization (evaluation point, etc.); the detailed interpretation
of `...` depends on the operator.
For example, `Sum` and `Product` can be called with one or two
parameters for the `For`-expression. Two parameters define
lower and upper bounds:

`Sum(f(n), For(n, a, b))`$$\sum_{n=a}^{b} f(n)$$

A single parameter specifies a set:

`Sum(f(n), For(n, ZZ))`$$\sum_{n \in \mathbb{Z}} f(n)$$

The expression for the summand and the generator expression can be followed by an optional predicate restricting the range of the iteration:

`Sum(f(n), For(n, ZZ), NotEqual(n, 0))`$$\sum_{\textstyle{n \in \mathbb{Z} \atop n \ne 0}} f(n)$$

(Another way to express the same summation is to specify
`SetMinus(ZZ, Set(0))` or $\mathbb{Z} \setminus \{0\}$ as the set.)

Operators are an example of a construct where the design is clearly
geared towards mathematical notation and not general-purpose programming.
For programming, it would have been natural to treat the summation
operator as an ordinary function which takes a function as input
(something like `Sum(f, a, b)`), but mathematical notation
calls for writing down `f(n)` as an expression
with an explicitly named bound variable.
Regarding the order of the inputs, I find *function* - *For-generator* - *predicate* very natural,
perhaps because it is similar both to Wolfram and to generator expressions
in Python (`sum(f(n) for n in range(a,b+1) if n != 0)`).
It also matches the natural English form "the sum of f for n from a to b such that n is not zero".

An instance where the order is less natural is in `All` and `Exists` expressions
(the universal and existential quantifiers).
These are written in the following way:

`All(Greater(x, 0), For(x, S))`$$x > 0 \;\text{ for all } x \in S$$ $$\forall x \in S : \, x > 0$$

(The LaTeX renderer supports two styles for logical expressions: using text, and using logical symbols. The two versions are shown above.)

With an additional predicate:

`All(Greater(x, 0), For(x, S), P(x))`$$x > 0 \;\text{ for all } x \in S \text{ with } P(x)$$ $$\forall x \in S, \,P(x) : \, x > 0$$

The *function* - *For-generator* - *predicate* order
is used for internal consistency with other operators, but it is
a bit unintuitive here.
The text version of the quantifier expression can be read
as "x is greater than zero for all x in S such that P of x",
but "for all x in S such that P of x, x is greater than zero"
is a more natural way to read the formula as it was rendered using logical symbols.
The order here is consistent with generator expressions
and the `all` and `any`
functions in Python, however.

## Collections

The basic expression language provides only one structural operation:
the function call. Collections of objects must therefore be specified
using function calls.
For example, `Set`, `Tuple` and `List`
are functions which construct sets, tuples and lists respectively
from given elements:

`Set()`$$\{\}$$

`Set(x, y)`$$\{x, y\}$$

`Set(Tuple(x, y), List(a, b, c))`$$\{(x, y), [a, b, c]\}$$

These functions will also support generator expressions
to allow expressing variable-length collections. `Tuple(Add(a_(k), c), For(k, 1, n))`
means $(a_1 + c, \ldots, a_n + c)$, for example (not yet implemented in the C version of the
LaTeX converter).
For set-builder notation, one can simply use `Set` as an operator
with a `For` expression to declare the iteration variable and base set,
along with an optional predicate:

`Set(f(x), For(x, RR))`$$\left\{ f(x) : x \in \mathbb{R} \right\}$$

`Set(f(x), For(x, RR), GreaterEqual(x, y))`$$\left\{ f(x) : x \in \mathbb{R}\,\mathbin{\operatorname{and}}\, x \ge y \right\}$$

This should eventually allow multiple generators (set comprehension over several variables; right now, a workaround is to use a tuple as the comprehension variable).

For constructing matrices, a double iteration may be used:

`Matrix(c_(i,j), For(i, 1, m), For(j, 1, n))`$$\displaystyle{\begin{pmatrix} c_{1, 1} & c_{1, 2} & \cdots & c_{1, n} \\ c_{2, 1} & c_{2, 2} & \cdots & c_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ c_{m, 1} & c_{m, 2} & \ldots & c_{m, n} \end{pmatrix}}$$

(Right now, this only works in the Python version of the LaTeX converter.)

Writing down matrices with explicit elements is a bit less convenient;
right now, you can write `Matrix(List(List(a, b), List(c, d)))`,
but I have considered allowing something like
`Matrix(2, 2, a, b, c, d)` and maybe even
something like
`Matrix(Row(a, b), Row(c, d))` or `Matrix(Col(a, c), Col(b, d))`.
Two by two matrices are so common that I have added a special
`Matrix2x2(a, b, c, d)`,
and there are also convenient constructors for row matrices, column
matrices and diagonal matrices.

What about multidimensional arrays? The problem here is that there
is no natural mathematical notation other than using nested lists of
lists. An alternative is to represent arrays implicitly by functions,
say `c_(i,j,k)` ($c_{i,j,k}$) for a three-dimensional array.

## Local definitions

The main construct for defining local constants and functions
is the `Where`-`Def` expression.
The expression `Where(expr, Def(x1, value1), Def(x2, value2), ...)`
first assigns the symbol `x1` the value `value1`,
then assigns the symbol `x2` the value `value2`, etc.,
and finally evaluates `expr` using these local definitions.
Here are some examples:

`Where(f(x), Def(x, a))`$$f(x)\; \text{ where } x = a$$

`Where(Mul(x, y, z), Def(x, 1), Def(y, Add(x, 2)), Def(z, Add(x, y, 3)))`$$x y z\; \text{ where } x = 1,\;y = x + 2,\;z = x + y + 3$$

Local functions can be defined as follows:

`Where(f(2, 3), Def(f(x, y), Mul(x, y)))`$$f(2, 3)\; \text{ where } f(x, y) = x y$$

Here, `x` and `y` are dummy variables which
get bound only within the `Def` expression itself.
I have not yet worked out all the semantics for function definitions
(for example, how to handle recursive definitions).

Conditional values (whether inside a function definition or somewhere else
in an expression) and piecewise-defined functions can be expressed using `Cases`:

`Where(f(Div(1, x)), Def(f(x), Cases(Case(1, Greater(x, 0)), Case(-1, Less(x, 0)), Case(0, Otherwise))))`$$f\!\left(\frac{1}{x}\right)\; \text{ where } f(x) = \begin{cases} 1, & x > 0\\-1, & x < 0\\0, & \text{otherwise}\\ \end{cases}$$

Destructuring assignments are possible. The following binds both `a`
and `b`, setting them to the components of the length-two tuple
or list `T`:

`Where(Add(a, b), Def(Tuple(a, b), T))`$$a + b\; \text{ where } \left(a, b\right) = T$$

With matrices:

`Where(Sub(Mul(a, d), Mul(b, c)), Def(Matrix2x2(a, b, c, d), M))`$$a d - b c\; \text{ where } \displaystyle{\begin{pmatrix}a & b \\ c & d\end{pmatrix}} = M$$

Variable-length destructuring assignments are also meant to be supported, making something like the following possible (but I have yet to work out the semantics in detail):

`Where(Sum(a_(i), For(i, 1, n)), Def(Tuple(a_(i), For(i, 1, n)), T))`$$\sum_{i=1}^{n} a_{i}\; \text{ where } \left(a_{1}, \ldots, a_{n}\right) = T$$

The syntax and calling convention for `Where`-expressions
is guided by the
mathematical notation.
As a programming construct, it would be a bit more natural to place
the definitions before the expression to be evaluated,
and `Where` is not the most obvious keyword,
but there is at least precedent in the
`where`-clause in Haskell.

I will probably add a similar construct for declaring assumptions on free variables (e.g. $x > 0$), which currently must be done separately from the expression itself.

## Mathematical semantics

Generating readable LaTeX output from symbolic expressions is easy; formalizing the semantics and implementing evaluation is much harder. There are three problems to solve:

- Relatively easy: handling the purely structural aspects of parsing special expressions, binding local variables, and so on.
- Hard: defining the semantics of the mathematical objects and operations represented by expressions.
- Very hard (apart from simple cases): implementing the algorithms for those operations.

In the simplest case, evaluating an expression just involves computing
with objects of one type:
for example, `Pow(Add(Sqrt(2), Pi, 1), 2)` ($(\sqrt{2} + \pi + 1)^2$)
can be viewed as a constant expression over $\mathbb{R}$; it can be
evaluated numerically using a traversal with Arb or exactly
using a traversal with the Calcium `ca_t` number type.

In general, however, symbolic expressions will involve many types of objects:
numbers, booleans, tuples, matrices, polynomials, functions, sets, etc.
Defining semantics for how such objects interact runs into
all the fundamental problems with formalizing mathematics; it becomes
necessary to define some kind of type system and to deal with
the distinction between equality and isomorphism.
My plan (perhaps naive) is work with naive set theory
and familiar atomic and algebraic types, relying on natural embeddings as far as possible.
For example, I strongly want (and currently have) $\mathbb{Z} \subset \mathbb{C}$ so that there
is no difference between the integer 1 and the complex number 1.
On the other hand, the matrix $I_n$ should be a distinct object from the number 1,
and "$I_n = 1$" must therefore be expressed
using some kind of isomorphism predicate other than the usual `Equal`
function.
The constant polynomial $1 \in \mathbb{C}[x]$ could either be viewed as
a complex number or as a distinct formal object, depending on whether
$\mathbb{C}[x]$ is viewed as an extension ring of $\mathbb{C}$
or as an entirely separate structure with a homomorphism into $\mathbb{C}$; the former should be sufficient
for complex analysis and elementary number theory,
but algebraists may need the latter. (It might be necessary, though inelegant,
to allow both constructions.)

There are many situations where standard mathematical definitions make it difficult to construct good formal semantics, or at least where it is difficult to map semantic symbolic expressions to conventional notation in a natural way. Here are some examples:

- It is extremely common in mathematical practice to identify the 1-tuple $(x)$ with the element $x$, especially when working with Cartesian products or defining univariate functions as special cases of multivariate functions. This seems like a horrible idea for symbolic computation! Unfortunately, with the 1-tuple as a distinct type of object from its element, one needs special constructions for chaining Cartesian products, defining domains of univariate versus multivariate functions, etc. (This can be viewed as a special case of the equality-vs-isomorphism problem.)
- The notation $\mathbb{Q}[x]$ is problematic as a means to construct a polynomial ring.
This is usually implicitly used to
*define*$x$, but syntactically it clearly assumes that $x$ is already defined (as a transcendental extension element of $\mathbb{Q}$, perhaps implicitly distinct from anything named $y$, etc.). You really want to express "define $x$ as a unique polynomial generator over $\mathbb{Q}$", but this sentence does not map well to any other standard notation. I don't have an elegant symbolic syntax here yet. - It is common in mathematical notation to omit specifying the domain
of a variable where it is implicit from the context ($n$ is an integer, $p$ is a prime number,
and so on).
Writing out domains everywhere in symbolic expressions can be clunky.
Fortunately, this problem
can be mitigated with special operators; for example, Calcium/Fungrim
allow you to write
`PrimeSum(f(p), For(p))`$\sum_p f(p)$, explicitly meaning that $p$ ranges over the set of positive prime numbers (and not over the mathematical universe, say).

I'm still thinking about the best way to handle various problems of this kind; I've tried some approaches when writing down formulas and theorems in Fungrim, but I have a feeling that I will need to revise many of the constructions.