The symbolic formula language in Calcium and Fungrim

January 31, 2021

In a recent post, I wrote about my approach for the low-level representation of symbolic expressions in C. Today, I will discuss the design of the formula language for representing mathematical content on top of symbolic expressions. This language consists of predefined ("builtin") symbol names and associated calling conventions for objects, functions and operators.

I originally developed the formula language for Fungrim, for which I wrote a Python implementation to handle LaTeX output and basic symbolic evaluation. I am now repurposing the language for Calcium, which will entail translating the backend from Python to C and also greatly extending the computational functionality. I'm currently about 80% done reimplementing the expression-to-LaTeX converter in C; you can see the current test suite output for the C version here.

The formula language is inspired by the Wolfram Language and by conventional mathematical notation, and to a lesser extent by other computer algebra systems, Python, and LaTeX. It is primarily designed for expressing formulas, and not for general purpose programming (though it does allow encoding simple functional programs). This immediately rules out many constructs that would be convenient for programming but which just don't map well to conventional mathematical notation. Another constraint is that I want the LaTeX conversion to be context-free; in general, the rendered output should depend only on the symbols and their positions in function calls, and not on the values assigned to variables in a surrounding context.

As part of the reimplementation work, I'm making tweaks to the language in an attempt to remove warts and improve clarity and versatility. Besides providing an overview of the formula language, this post will discuss some of the unsolved design problems.

Expression composition

The underlying symbolic expression format consists of atoms and composite expressions. Atoms are symbols (x, Integral), strings ("Hello") or integers (-12345). Composite expressions are formal function calls (f(x1, x2, ..., xn)). All formulas must be represented using these primitives; complex atomic datatypes are excluded by design to make symbolic expressions easy to store, traverse and convert to other formats.

Names

Builtin objects and operators have CamelCase names. Most objects have verbose names (for example, BesselJ(n, x) for $J_n(x)$, ClosedOpenInterval(a, b) for $[a, b)$, IdentityMatrix(n) for $I_n$), with the exception of some very common objects where an abbreviated name is clearly recognizable (for example, Mul(x, y, z) for $xyz$, Abs(x) for $|x|$, and ZZ for $\mathbb{Z}$).

The CamelCase convention is taken from Wolfram, and of course coincides with the convention for naming classes in many object-oriented languages such as Python. The reason is mainly aesthetical; I don't prefer CamelCase in most contexts, but it feels natural for naming mathematical objects in symbolic expressions (perhaps because they look more "official" that way). It also has some practical benefits:

It unambiguously leaves all symbol names starting with a lowercase letter for user symbols. (I am still undecided about allowing users to shadow builtin symbol names.)
There is no confusion over whether multi-word names contain underscores (only IsPrime is possible; there is no need to remember whether it is isprime or is_prime).

All lowercase letters a-z and uppercase letters A-Z are reserved for user variables. Names such as alpha, beta, etc. give the Greek symbols $\alpha$, $\beta$, etc. when rendered to LaTeX. As a special case, I decided to use the alternative spelling for lamda for $\lambda$ to avoid an inconvenient clash with a keyword in Python (there are some other collisions of this kind; for example True/False collide with the boolean constants in Python, but with automatic conversions this turns out to be a very minor problem).

A cute hack I came up with is to recognize underscore-suffixed symbol names as a way to render subscripts automatically for user variables. For example, a(n) renders as $a(n)$ in LaTeX, while a_(n) renders as $a_n$. You can even write gamma_(n,m)(x,y,z) for $\gamma_{n,m}(x,y,z)$. It's totally a hack (there's some risk of confusion since a and a_ are separate variables and not implicitly related), but writing down sequences this way felt natural instantly the moment I started using it in Fungrim, so it will stay.

Naming things

Naming things is hard. Sometimes very hard. Some of the builtin names I have chosen (in many cases, copied directly from Wolfram) are perfect, but some are clumsy or mutually inconsistent and I'm sure I will continue renaming things for a long time. This is one area where feedback from users would be extremely useful.

Here is an example of a rather silly naming problem I'm wrestling with: the naming of the imaginary unit $i$. ImaginaryUnit would be perfect, except that is extremely long for an object that is so common. I'm currently using NumberI, which is explicit enough but still relatively long and not very standard. Im would have been acceptable, but it is used for the imaginary part function, which seems like a too common operation to name ImaginaryPart. Wolfram and many other systems reserve I and/or i to denote the imaginary unit, but these are useful variable names. Question: is the imaginary unit important enough to reserve I as a builtin constant, forcing the user to use some nonobvious symbol name (upperI or varI perhaps) if they want a variable named $I$? Note that it doesn't make sense to use I for the imaginary unit and also allow the user to shadow it as a variable, because the LaTeX output should display the imaginary unit as $i$, and it is meant to be context-free.

The same issue exists for $e$, but it is less important to have a short name here since Exp(x) works well for denoting the exponential function; the number is rarely needed on its own. I already decided to reserve Pi for the number $\pi$ and Gamma for the gamma function; the uppercase Greek letters $\Pi$ and $\Gamma$ (as a variables) are currently named GreekPi and GreekGamma. N should in any case remain a variable name; the Wolfram language always annoys me when I can't sum n from 0 to N.

Arithmetic expressions

Arithmetic operations and mathematical functions (Add, Mul, Div, Pow, Sqrt, Exp, Log, Gamma, etc.) are represented using simple function calls. There is no notion of infix operators in the expression format itself, but infix syntax can be handled at a parser level or in wrappers. For example, typing x + y * z in Python generates Add(x, Mul(y, z)).

The LaTeX converter inserts parentheses automatically to indicate the correct precedence in displayed formulas. It also removes some parentheses and signs where they are redundant or would look ugly, by assuming associativity. For example, Add(x, y, z) and Add(x, Add(y, z)) both render as $x + y + z$, and Add(x, Div(-2, 3)) renders as $x - \frac{2}{3}$ rather than as $x + \frac{-2}{3}$. This behavior will be configurable for applications where it is vital to display the exact internal structure of arithmetic expressions. It is also possible to insert manual parentheses and brackets to clarify grouping; for example, Add(x, Div(Parentheses(-2), 3)) renders as $x + \frac{\left(-2\right)}{3}$ (semantically, Parentheses can be understood as representing the identity function).

The representation of arithmetic expressions is intended to be simple, not to be as compact as possible. The monomial Mul(3, Pow(x, 2), Pow(y, 4)) ($3 x^2 y^4$) could be encoded more compactly as something like Monomial(3, 1, x, 2, y, 4). An entire polynomial could be encoded even more compactly using arrays of coefficients and exponents. Such constructions might be added at a later date if they turn out to be essential for performance (for the moment, I'm assuming that Flint and Calcium polynomial types will be used for polynomial computations and that symbolic expressions will be used only as an interface to those types, so squeezing out the last inch of performance is not essential).

Operators and generator expressions

Operators (not to be confused with arithmetic operators) are builtin symbols that express some transformation applied to a function or a set. Examples include Sum, DivisorSum, PrimeSum, Product, Integral, Limit, Derivative, Minimum, Supremum, ArgMax, Solutions, UniqueSolution, Zeros, and others. Rather than acting as normal functions taking constant values as input, operators interpret some of their input expressions as functions with respect to locally bound variables.

All operators use the same syntax for binding variables: the special For-expression. The generator expression For(n, ...) binds n as a dummy variable in the scope of the parent expression. The additional arguments ... specify the range of the iteration (lower or upper bounds, iteration set, etc.) or localization (evaluation point, etc.); the detailed interpretation of ... depends on the operator. For example, Sum and Product can be called with one or two parameters for the For-expression. Two parameters define lower and upper bounds:

Sum(f(n), For(n, a, b)) $$\sum_{n=a}^{b} f(n)$$

A single parameter specifies a set:

Sum(f(n), For(n, ZZ)) $$\sum_{n \in \mathbb{Z}} f(n)$$

The expression for the summand and the generator expression can be followed by an optional predicate restricting the range of the iteration:

Sum(f(n), For(n, ZZ), NotEqual(n, 0)) $$\sum_{\textstyle{n \in \mathbb{Z} \atop n \ne 0}} f(n)$$

(Another way to express the same summation is to specify SetMinus(ZZ, Set(0)) or $\mathbb{Z} \setminus \{0\}$ as the set.)

Operators are an example of a construct where the design is clearly geared towards mathematical notation and not general-purpose programming. For programming, it would have been natural to treat the summation operator as an ordinary function which takes a function as input (something like Sum(f, a, b)), but mathematical notation calls for writing down f(n) as an expression with an explicitly named bound variable. Regarding the order of the inputs, I find function - For-generator - predicate very natural, perhaps because it is similar both to Wolfram and to generator expressions in Python (sum(f(n) for n in range(a,b+1) if n != 0)). It also matches the natural English form "the sum of f for n from a to b such that n is not zero".

An instance where the order is less natural is in All and Exists expressions (the universal and existential quantifiers). These are written in the following way:

All(Greater(x, 0), For(x, S)) $$x > 0 \;\text{ for all } x \in S$$ $$\forall x \in S : \, x > 0$$

(The LaTeX renderer supports two styles for logical expressions: using text, and using logical symbols. The two versions are shown above.)

With an additional predicate:

All(Greater(x, 0), For(x, S), P(x)) $$x > 0 \;\text{ for all } x \in S \text{ with } P(x)$$ $$\forall x \in S, \,P(x) : \, x > 0$$

The function - For-generator - predicate order is used for internal consistency with other operators, but it is a bit unintuitive here. The text version of the quantifier expression can be read as "x is greater than zero for all x in S such that P of x", but "for all x in S such that P of x, x is greater than zero" is a more natural way to read the formula as it was rendered using logical symbols. The order here is consistent with generator expressions and the all and any functions in Python, however.

Collections

The basic expression language provides only one structural operation: the function call. Collections of objects must therefore be specified using function calls. For example, Set, Tuple and List are functions which construct sets, tuples and lists respectively from given elements:

Set() $$\{\}$$ Set(x, y) $$\{x, y\}$$ Set(Tuple(x, y), List(a, b, c)) $$\{(x, y), [a, b, c]\}$$

These functions will also support generator expressions to allow expressing variable-length collections. Tuple(Add(a_(k), c), For(k, 1, n)) means $(a_1 + c, \ldots, a_n + c)$, for example (not yet implemented in the C version of the LaTeX converter). For set-builder notation, one can simply use Set as an operator with a For expression to declare the iteration variable and base set, along with an optional predicate:

Set(f(x), For(x, RR)) $$\left\{ f(x) : x \in \mathbb{R} \right\}$$ Set(f(x), For(x, RR), GreaterEqual(x, y)) $$\left\{ f(x) : x \in \mathbb{R}\,\mathbin{\operatorname{and}}\, x \ge y \right\}$$

This should eventually allow multiple generators (set comprehension over several variables; right now, a workaround is to use a tuple as the comprehension variable).

For constructing matrices, a double iteration may be used:

Matrix(c_(i,j), For(i, 1, m), For(j, 1, n)) $$\displaystyle{\begin{pmatrix} c_{1, 1} & c_{1, 2} & \cdots & c_{1, n} \\ c_{2, 1} & c_{2, 2} & \cdots & c_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ c_{m, 1} & c_{m, 2} & \ldots & c_{m, n} \end{pmatrix}}$$

(Right now, this only works in the Python version of the LaTeX converter.)

Writing down matrices with explicit elements is a bit less convenient; right now, you can write Matrix(List(List(a, b), List(c, d))), but I have considered allowing something like Matrix(2, 2, a, b, c, d) and maybe even something like Matrix(Row(a, b), Row(c, d)) or Matrix(Col(a, c), Col(b, d)). Two by two matrices are so common that I have added a special Matrix2x2(a, b, c, d), and there are also convenient constructors for row matrices, column matrices and diagonal matrices.

What about multidimensional arrays? The problem here is that there is no natural mathematical notation other than using nested lists of lists. An alternative is to represent arrays implicitly by functions, say c_(i,j,k) ($c_{i,j,k}$) for a three-dimensional array.

Local definitions

The main construct for defining local constants and functions is the Where-Def expression. The expression Where(expr, Def(x1, value1), Def(x2, value2), ...) first assigns the symbol x1 the value value1, then assigns the symbol x2 the value value2, etc., and finally evaluates expr using these local definitions. Here are some examples:

Where(f(x), Def(x, a)) $$f(x)\; \text{ where } x = a$$ Where(Mul(x, y, z), Def(x, 1), Def(y, Add(x, 2)), Def(z, Add(x, y, 3))) $$x y z\; \text{ where } x = 1,\;y = x + 2,\;z = x + y + 3$$

Local functions can be defined as follows:

Where(f(2, 3), Def(f(x, y), Mul(x, y))) $$f(2, 3)\; \text{ where } f(x, y) = x y$$

Here, x and y are dummy variables which get bound only within the Def expression itself. I have not yet worked out all the semantics for function definitions (for example, how to handle recursive definitions).

Conditional values (whether inside a function definition or somewhere else in an expression) and piecewise-defined functions can be expressed using Cases:

Where(f(Div(1, x)), Def(f(x), Cases(Case(1, Greater(x, 0)), Case(-1, Less(x, 0)), Case(0, Otherwise)))) $$f\!\left(\frac{1}{x}\right)\; \text{ where } f(x) = \begin{cases} 1, & x > 0\\-1, & x < 0\\0, & \text{otherwise}\\ \end{cases}$$

Destructuring assignments are possible. The following binds both a and b, setting them to the components of the length-two tuple or list T:

Where(Add(a, b), Def(Tuple(a, b), T)) $$a + b\; \text{ where } \left(a, b\right) = T$$

With matrices:

Where(Sub(Mul(a, d), Mul(b, c)), Def(Matrix2x2(a, b, c, d), M)) $$a d - b c\; \text{ where } \displaystyle{\begin{pmatrix}a & b \\ c & d\end{pmatrix}} = M$$

Variable-length destructuring assignments are also meant to be supported, making something like the following possible (but I have yet to work out the semantics in detail):

Where(Sum(a_(i), For(i, 1, n)), Def(Tuple(a_(i), For(i, 1, n)), T)) $$\sum_{i=1}^{n} a_{i}\; \text{ where } \left(a_{1}, \ldots, a_{n}\right) = T$$

The syntax and calling convention for Where-expressions is guided by the mathematical notation. As a programming construct, it would be a bit more natural to place the definitions before the expression to be evaluated, and Where is not the most obvious keyword, but there is at least precedent in the where-clause in Haskell.

I will probably add a similar construct for declaring assumptions on free variables (e.g. $x > 0$), which currently must be done separately from the expression itself.

Mathematical semantics

Generating readable LaTeX output from symbolic expressions is easy; formalizing the semantics and implementing evaluation is much harder. There are three problems to solve:

Relatively easy: handling the purely structural aspects of parsing special expressions, binding local variables, and so on.
Hard: defining the semantics of the mathematical objects and operations represented by expressions.
Very hard (apart from simple cases): implementing the algorithms for those operations.

In the simplest case, evaluating an expression just involves computing with objects of one type: for example, Pow(Add(Sqrt(2), Pi, 1), 2) ($(\sqrt{2} + \pi + 1)^2$) can be viewed as a constant expression over $\mathbb{R}$; it can be evaluated numerically using a traversal with Arb or exactly using a traversal with the Calcium ca_t number type.

In general, however, symbolic expressions will involve many types of objects: numbers, booleans, tuples, matrices, polynomials, functions, sets, etc. Defining semantics for how such objects interact runs into all the fundamental problems with formalizing mathematics; it becomes necessary to define some kind of type system and to deal with the distinction between equality and isomorphism. My plan (perhaps naive) is work with naive set theory and familiar atomic and algebraic types, relying on natural embeddings as far as possible. For example, I strongly want (and currently have) $\mathbb{Z} \subset \mathbb{C}$ so that there is no difference between the integer 1 and the complex number 1. On the other hand, the matrix $I_n$ should be a distinct object from the number 1, and "$I_n = 1$" must therefore be expressed using some kind of isomorphism predicate other than the usual Equal function. The constant polynomial $1 \in \mathbb{C}[x]$ could either be viewed as a complex number or as a distinct formal object, depending on whether $\mathbb{C}[x]$ is viewed as an extension ring of $\mathbb{C}$ or as an entirely separate structure with a homomorphism into $\mathbb{C}$; the former should be sufficient for complex analysis and elementary number theory, but algebraists may need the latter. (It might be necessary, though inelegant, to allow both constructions.)

There are many situations where standard mathematical definitions make it difficult to construct good formal semantics, or at least where it is difficult to map semantic symbolic expressions to conventional notation in a natural way. Here are some examples:

It is extremely common in mathematical practice to identify the 1-tuple $(x)$ with the element $x$, especially when working with Cartesian products or defining univariate functions as special cases of multivariate functions. This seems like a horrible idea for symbolic computation! Unfortunately, with the 1-tuple as a distinct type of object from its element, one needs special constructions for chaining Cartesian products, defining domains of univariate versus multivariate functions, etc. (This can be viewed as a special case of the equality-vs-isomorphism problem.)
The notation $\mathbb{Q}[x]$ is problematic as a means to construct a polynomial ring. This is usually implicitly used to define $x$, but syntactically it clearly assumes that $x$ is already defined (as a transcendental extension element of $\mathbb{Q}$, perhaps implicitly distinct from anything named $y$, etc.). You really want to express "define $x$ as a unique polynomial generator over $\mathbb{Q}$", but this sentence does not map well to any other standard notation. I don't have an elegant symbolic syntax here yet.
It is common in mathematical notation to omit specifying the domain of a variable where it is implicit from the context ($n$ is an integer, $p$ is a prime number, and so on). Writing out domains everywhere in symbolic expressions can be clunky. Fortunately, this problem can be mitigated with special operators; for example, Calcium/Fungrim allow you to write PrimeSum(f(p), For(p)) $\sum_p f(p)$, explicitly meaning that $p$ ranges over the set of positive prime numbers (and not over the mathematical universe, say).

I'm still thinking about the best way to handle various problems of this kind; I've tried some approaches when writing down formulas and theorems in Fungrim, but I have a feeling that I will need to revise many of the constructions.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor