FLINT development in the age of AI

May 14, 2026

Well, 2026 seems to be the year when "AI" actually became useful, and I guess I'm in good company saying so. Here I'll report about my recent experience using AI (mainly Claude Pro) for FLINT development.

Contents

Explaining FLINT code
Proving technical lemmas
Implementing algorithms
- Kinoshita-Li power series composition
- Trigonometric functions of power series
Verdict

Explaining FLINT code

A very conservative application of AI is to just explain existing code. This ability should be especially useful for new users and developers, but it's probably also going to be helpful for me as a maintainer as FLINT contains a fair amount of code outside of my own areas of expertise written by people who are long gone from the project.

I've tried to generate explanations for dozens of FLINT functions with different AIs, and despite occasional errors, the results are generally becoming good. To give a random example, here is Claude's explanation of the function qqbar_sgn_re, which I asked it to generate from the code without giving any other hints:

Overview

qqbar_sgn_re computes the sign of the real part of an algebraic number $x$, returning $-1$, $0$, or $+1$. In FLINT, a qqbar_t represents an element of $\overline{\mathbb{Q}}$ — an exact algebraic number — stored as two things together: the minimal polynomial over $\mathbb{Z}$ whose root it is, and a ball in $\mathbb{C}$ (an acb_t enclosure) that isolates it from all other roots of that polynomial.

The challenge is that this must be an exact decision despite the number being represented only implicitly. The function works through a hierarchy of cases, using fast shortcuts first and falling back to a rigorous iterative refinement loop.

Case 1: Rational numbers (degree 1)

if (qqbar_degree(x) == 1)
    return -fmpz_sgn(QQBAR_COEFFS(x));

When $x$ is rational, its minimal polynomial is linear: $c_1 x + c_0 = 0$, so $x = -c_0/c_1$. The coefficients are stored as QQBAR_COEFFS(x) in order from lowest to highest degree, so the leading coefficient is QQBAR_COEFFS(x)[1] and the constant is QQBAR_COEFFS(x)[0]. Since the minimal polynomial is primitive with positive leading coefficient by convention, $\operatorname{sgn}(x) = -\operatorname{sgn}(c_0)$, which is $-\operatorname{sgn}(\mathtt{QQBAR\_COEFFS}(x)[0])$. This is computed by fmpz_sgn on the zeroth coefficient.

Case 2: Enclosure already certifies zero

else if (arb_is_zero(acb_realref(QQBAR_ENCLOSURE(x))))
    return 0;

If the stored enclosure ball has real part that is exactly the zero ball (radius zero, midpoint zero), then $\operatorname{Re}(x) = 0$ is already certified. This happens when the enclosure was previously refined to lie on the imaginary axis.

Case 3: Enclosure already separates from zero

else if (!arb_contains_zero(acb_realref(QQBAR_ENCLOSURE(x))))
    return arf_sgn(arb_midref(acb_realref(QQBAR_ENCLOSURE(x))));

If zero is not contained in the real part of the enclosure ball, then the sign is determined immediately: it is the sign of the midpoint of that ball (since the entire ball is on one side of zero). This is the common fast path for numbers that are not purely imaginary.

Case 4: Enclosure straddles zero — iterative refinement

When the enclosure ball for $\operatorname{Re}(x)$ contains zero but is not certified to be zero, the sign is genuinely ambiguous at the current precision. The function must work harder.

Parity shortcut: could $\operatorname{Re}(x)$ be zero?

maybe_zero = 1;
for (i = 1; i < d && maybe_zero; i += 2)
    if (!fmpz_is_zero(QQBAR_COEFFS(x) + i))
        maybe_zero = 0;

Before entering the refinement loop, the code checks a necessary condition for $\operatorname{Re}(x) = 0$: if $x$ is purely imaginary, then $x$ satisfies a polynomial with only even-degree terms (because $p(iy) = 0$ implies $p$ has only even powers). Equivalently, all odd-degree coefficients of the minimal polynomial must vanish. If any odd coefficient is nonzero, then $\operatorname{Re}(x) \ne 0$ is guaranteed by algebra alone, and the code sets maybe_zero = 0, skipping the more expensive zero-certification test inside the loop.

The refinement loop

for (prec = QQBAR_DEFAULT_PREC / 2; ; prec *= 2)
{
    _qqbar_enclosure_raw(t, QQBAR_POLY(x), t, prec);

    if (!arb_contains_zero(acb_realref(t)) || arb_is_zero(acb_realref(t)))
    {
        res = arf_sgn(arb_midref(acb_realref(t)));
        break;
    }

    if (maybe_zero)
    {
        acb_set(u, t);
        arb_zero(acb_realref(u));
        if (_qqbar_validate_existence_uniqueness(u, QQBAR_POLY(x), u, prec * 2))
        {
            res = 0;
            break;
        }
    }
}

The loop doubles the working precision on each iteration. Two exit conditions exist:

Nonzero certified. _qqbar_enclosure_raw refines the enclosure ball t to the requested precision using interval Newton iterations on the minimal polynomial. If after refinement the real part of t no longer contains zero (or is the exact zero ball), the sign is read off from the midpoint and the loop exits. As precision doubles, the ball radius shrinks exponentially, so if $\operatorname{Re}(x) \ne 0$ this will eventually succeed.
Zero certified. When maybe_zero is set, the code constructs a modified ball u from t but with its real part forced to zero — i.e., a vertical strip on the imaginary axis. It then calls _qqbar_validate_existence_uniqueness, which uses an interval arithmetic argument (essentially a rigorous version of Rouché's theorem or a Krawczyk/Newton test) to verify that the minimal polynomial has exactly one root inside u. Since $x$ is already known to be the unique root in the original enclosure, and the imaginary-axis sub-ball is contained in it, this proves $\operatorname{Re}(x) = 0$ exactly.

Termination

The loop always terminates. If $\operatorname{Re}(x) \ne 0$, the enclosure radius halves (relative to the distance from zero) on each doubling of precision, so the sign is eventually resolved. If $\operatorname{Re}(x) = 0$, the parity check guarantees maybe_zero is set, and the uniqueness test will eventually succeed once the enclosure is tight enough for the interval Newton contractor to confirm the root lies exactly on the imaginary axis.

Summary

Condition	Method	Cost
$x$ rational (degree 1)	Read sign of constant coefficient	$O(1)$
Enclosure already on imaginary axis	Direct read	$O(1)$
Enclosure separated from imaginary axis	Sign of enclosure midpoint	$O(1)$
Enclosure straddles imaginary axis, $\operatorname{Re}(x) \ne 0$	Iterative precision doubling until separation	$O(\text{prec})$ per iteration
Enclosure straddles imaginary axis, $\operatorname{Re}(x) = 0$	Parity check + uniqueness certification on imaginary axis	$O(\text{prec})$ per iteration

The overall design reflects a standard pattern in exact real/algebraic computation: try cheap numerical tests first, and fall back to a rigorous (but more expensive) algebraic or interval-analytic certificate only when the numerical evidence is genuinely ambiguous.

The explanation is spot on apart from the bogus claimed costs of $O(\text{prec})$, an error which is not really important in the context of this particular function. I would remove the useless summary section/table (one of the telltale signs of being AI-written), but other than that don't think I could have written it better myself. It seems that Claude has some context from preexisting training on the FLINT codebase and documentation, but the specific algorithm here has not been documented and probably doesn't occur in any textbooks. This function is rather simple, but one can get good explanations of much more complex functions too. Claude is also able to generate standalone example programs that demonstrate how to actually use FLINT functions; here is what it came up with for qqbar_sgn_re.

What can we do with this? FLINT has decent API documentation, but the documentation of algorithms is spotty, as is the selection of examples and tutorials. I can now imagine using AI to help extend these aspects of the documentation. I'm guessing that letting AI do the first draft and then manually proofreading and editing to a publication-quality standard would require an order of magnitude less work than actually writing everything by hand. Not sure if such a documentation project would really be worthwhile, but it seems perfectly feasible.

Proving technical lemmas

A big mathematical software library like FLINT relies on hundreds if not thousands of implicit (often undocumented) technical lemmas. By this I mean equalities, bounds or invariants, often concerned with implementation details, which typically don't require any deep mathematical insight, but which may require tedious case-by-case analysis to prove rigorously. It seems that AIs are now becoming capable of routinely proving such results, which is potentially going to be a big time saver for development. Here I'll share one small success story.

One of the most important low-level operations in FLINT is 3-to-1 word modular reduction: given a single-word modulus $n < 2^{64}$ and a three-word integer $y = y_2 B^2 + y_1 B + y_0$ where $B = 2^{64}$ is the machine-word radix, compute the canonical residue $y \bmod n$ in $[0, n)$. This operation occurs as the final output conversion in bilinear operations like polynomial multiplication and matrix multiplication over $\mathbb{Z} / n \mathbb{Z}$ where we accumulate a three-word result over $\mathbb{Z}$ before reducing.

The best algorithm I've come up with so far is to precompute $\alpha_1 = B \bmod n$ and $\alpha_2 = B^2 \bmod n$, form the 128-bit integer $y' = \alpha_2 y_2 + \alpha_1 y_1 + y_0$, and reduce this using a version of Barrett reduction if $n < 2^{64}$ and Granlund-Möller (GM) division if $n \ge 2^{64}$. The reason for this case distinction is that the Barrett variant is most efficient with one bit of slack for overflow detection, while GM works natively with moduli that have the most significant bit set and requires extra shifts otherwise.

An example of the kind of technical lemma I mentioned above is the fact that $y'$ does not overflow 128 bits, even when $n$ is close to a full word (e.g. $n = 2^{64}-1$). This follows from some extremely fortunate cancellations in the relation between the values of $\alpha_1$, $\alpha_2$ and $n$. That lemma is easy, but here is a trickier one: GM requires a priori that the high word of $y'$ is already reduced mod $n$, i.e. $y' < n B$, so in general one needs to conditionally subtract $n$ from the high word of $y'$ before calling GM. While experimenting the 3-to-1 reduction code I noticed that the algorithm seems to work correctly even without the conditional subtraction if $n = 2^{63} + k$ for sufficiently small $k$, conjecturally for all $k \le 2^{30}$. Removing this branch entirely for such moduli should save ~1 CPU cycle per modular reduction. Primes of this form are used extensively within FLINT for multimodular algorithms (stepping up from $2^{63}$ with n_nextprime), which is a significant enough use case to micro-optimize for.

The precise result we need is that the GM division implemented by FLINT's NMOD_RED2 macro is correct for all input $0 \le x < B^2$ and moduli $n = B/2 + k$ where $0 \le k \le \lfloor \sqrt{B / 16 } \rfloor$, for any binary word radix $B$. The upper bound here is tight; one can easily find counterexamples for $k = \lfloor \sqrt{B / 16 } \rfloor + 1$. Courtesy of Bill Hart, here is an automatically generated proof by Gemini 3.1 Pro, and a cleaned up version by GPT 5.4.

I have reviewed this proof as carefully as I would review a human-made proof, and it looks correct. I've sanity-checked it by stepping through the algorithm for various inputs and verifying the case distinctions described in the proof. There is a small potential point of confusion: the variable $q_1$ in the proof refers to an unwrapped value while the variable q1 in the code actually represents $q_1 \bmod B$ with wraparound, and the proof does its case distinctions with respect to the unwrapped value. This is correct; one just has to be careful about it.

In combination with having observed that the code works as intended for billions of randomly generated inputs and having verified the lemma by brute force for radixes $B = 2^b$ with $b \le 17$, I'm now confident enough to use this optimization in FLINT. This turned out to give a 2% speedup for FFT polynomial multiplication in certain ranges (just to give an example; this modular reduction is also used elsewhere), which is not negligible for a core routine.

The proof involves only elementary inequalities but has many steps. I could surely have proved the result by hand, but it might have taken me a few days of work with possible false starts, distractions, double checking and fixing my own mistakes, and writing things up carefully. The result is too uninteresting to publish and the proof does not really yield insight into anything, so it's terrific if this kind of work can be automated.

AIs still seem to hallucinate enough when asked to come up with proofs that independent verification remains necessary. Ideally, we'd just have the AI generate a completely formal proof which can be checked by something like Lean. Various other people have discussed doing formal verification of parts of FLINT for a while now, but I'm not sure what the latest progress is on that front.

Implementing algorithms

This is where things get interesting: can we use AI for new feature development in FLINT? A good place to start experimenting seems to be functions which I know how to write myself (so that I can review the generated code with high confidence) but where it would be time-consuming to actually do so (so that we're actually getting some benefit out of it). A good candidate was the Kinoshita-Li algorithm for composition of power series in near-linear time, a breakthrough complexity result from just two years ago which is not yet used in any computer algebra system but clearly ought to be in FLINT. The algorithm is pretty simple, but somewhat fiddly to implement efficiently as it involves translating back and forth between univariate and bivariate polynomials of different shape. There are also some special power series compositions I've wanted to improve for a long time, namely trigonometric function evaluations.

In the span of a few days, I worked with Claude to do the following which have now all been merged in FLINT (PR #2658, #2663, #2664, #2672):

Add gr_poly_compose_series_kinoshita_li - quasilinear power series composition over generic rings.
Add fmpq_poly_compose_series_kinoshita_li - the same algorithm specialized for $\mathbb{Q}$.
Optimize the existing fmpq_poly_compose_series_brent_kung to use a scaled integer matrix instead of a matrix with fraction entries.
Add gr_poly_sin_cos_series_newton - direct Newton iteration for sine and cosine of power series, replacing the previous slower algorithm which computed the sine and cosine from the tangent.
Replace gr_tan_series_basecase - implement a direct recurrence for tangent-like functions instead of computing them via the basecase algorithm for sine and cosine, and generalize to computing variants of the function (tanh, cot, coth, tan_pi, cot_pi)
Optimize gr_tan_series_newton - do inline Newton iteration for the inverse tangent, and generalize to compute variants of the function.
Add gr_tan_series_exponential - computing tangent-like functions via exponentials, using appropriate case distinctions for different parts of the complex plane.

I prompted Claude to write most of the algorithm code, though I did some final editing by hand. I did not use Claude for profiling, final test code, wrappers, and high-level algorithm selection (ensuring that "default" functions like gr_poly_compose_series and gr_poly_tan_series automatically choose an appropriate algorithm depending on the size and type of input).

In the interest of science, here are the full chats with Claude (my own mistakes and tangents included):

Chat 1: implementing Kinoshita-Li composition for gr_poly and fmpq_poly; optimizing the Brent-Kung composition for fmpq_poly.
Chat 2: implementing basecase and improved Newton iteration code for tangent-like functions of power series; Newton iteration for sine and cosine.

Since these chats are quite long, I'm going to attempt to summarize what happened and provide some highlights.

Kinoshita-Li power series composition

My initial prompt for the Kinoshita-Li algorithm was as follows:

Your job is to write a FLINT implementation of the quasilinear-time algorithm for power series composition discussed in https://github.com/flintlib/flint/issues/1911

The algorithm itself was published in https://arxiv.org/pdf/2404.05177.pdf. There appears to be a C++ implementation at https://github.com/hly1204/library/blob/master/fps_composition.hpp and also a Sage implementation in the file composition.zip linked in the issue.

The implementation should work with FLINT's generic rings, be in the style of existing FLINT gr_poly methods, and have an interface similar to https://github.com/flintlib/flint/blob/main/src/gr_poly/compose_series_brent_kung.c. Note that there should be a "non-underscore method" which is a simple wrapper around the "underscore method" working on raw coefficient arrays, where the actual action happens.

To compose in R[x] / x^n where we are given a context object ctx representing the coefficient ring R, the algorithm crucially involves bivariate polynomial arithmetic in R[x][y] which needs to be asymptotically fast. One option would be to work in (R[x])[y] with gr_poly coefficients (gr_ctx_init_gr_poly(ctx2, ctx) creates a context object for such polynomials). FLINT already implements fast multiplication of such polynomials using Kronecker substitution (KS); see https://github.com/flintlib/flint/blob/main/src/gr_poly/mullow_bivariate_KS.c.

However, it will probably be more efficient to avoid creating recursive objects; instead, implement the algorithm entirely using 1D arrays of R-coefficients where you stride and zero-pad appropriately for bivariate KS multiplication with a single call to  _gr_poly_mul (or _gr_poly_mulmid if you want truncations).

Claude's first effort had the right structure, but failed the test code. When I asked Claude to debug, it asked me to share the proof-of-concept Sage code by Vincent Neiger for reference; given this, Claude was able to identify and fix the bugs, producing a correct version which I could benchmark and which clearly exhibited the expected asymptotic speedup over Brent-Kung. After that, I worked with Claude to make several changes before preparing the final version:

I asked Claude to use David Harvey's multipoint Kronecker substitution instead of the standard Kronecker substitution for the bivariate multiplication, which it (remarkably!) also did correctly. Unfortunately this ended up running slower than the original version, and my experiments showed that the multipoint substitution is far less numerically stable than the standard one when used with ball coefficients, so I had Claude revert this change.
I asked Claude to convert the implementation from its initial recursive formulation to an iterative one with up-front memory allocation. When I asked Claude to analyze the memory usage, it claimed incorrectly that the iterative version achieves $O(n)$ memory instead of $O(n \log n)$, but it was able to do a correct analysis when I pointed out the error.
I asked Claude to implement both x-major and y-major Kronecker substitutions to see if one performed better than the other. Rather hilariously, my first prompt which used the word "transpose" misled Claude to attempt to transpose the entire algorithm instead of just transposing the bivariate array used for the multiplication. This was easily rectified. When I profiled both x-major and y-major versions and a hybrid version that switches between them dynamically, I noticed some performance discrepancies which led me to suspect that Claude had implemented the Kronecker substitution with more zero-padding than necessary. After a few more prompts, Claude seemed to have gotten the zero-padding right. I then did further extensive benchmarking and found that the y-major version consistently performs best, so I finally prompted Claude to simplify the code by removing the other versions.
I noticed several other minor inefficiencies while carefully reviewing the code, and prompted Claude to fix them.

Claude was then able to translate the algorithm from gr_poly to fmpq_poly without much difficulty. Note that this is not just a trivial renaming operation as the data structures are different (fmpq_poly represents a polynomial over $\mathbb{Q}$ as an integer polynomial with a common denominator, not as a vector of rational coefficients).

Timings for power series composition

I can now present some real performance results for Kinoshita-Li composition, verifying that the algorithm actually performs as expected. There are more benchmark results in the PRs linked above.

The following timings (in seconds) are for composing two random power series of length $n$ over $\mathbb{Z} / p \mathbb{Z}$, $p = 2^{63} + 29$, with the Brent-Kung baby-step giant-step algorithm as a baseline:

n	Brent-Kung	Kinoshita-Li	Speedup
100	7.79e-05	0.000162	0.481
200	0.000285	0.000347	0.821
400	0.000869	0.000839	1.036
800	0.00254	0.00193	1.316
1600	0.00763	0.00434	1.758
3200	0.0238	0.00974	2.444
6400	0.0776	0.0218	3.560
12800	0.257	0.0487	5.277
25600	0.882	0.115	7.670
51200	3.036	0.266	11.414
102400	10.82	0.646	16.749
204800	37.76	1.556	24.267
409600	140.573	3.736	37.627
819200	497.694	10.274	48.442

The speedup is excellent. Also, Brent-Kung runs out of RAM around $n = 10^6$ while Kinoshita-Li is far more memory-efficient.

It works well with multiprecision and algebraic coefficients too. The following timings are for computing $f(f(x))$ over $\mathbb{Z}$ where $f = \sum_{k \ge 1} F_k x^k$ is the Fibonacci number generating function:

n	Brent-Kung	Kinoshita-Li	Speedup
100	0.000298	0.000443	0.673
200	0.00243	0.00284	0.856
400	0.0153	0.0144	1.062
800	0.108	0.0683	1.581
1600	0.682	0.339	2.012
3200	4.651	1.716	2.710
6400	33.263	8.462	3.931
12800	266.59	43.331	6.152

The following timings are for computing $f(f(x))$ where $f = \log(1+x)$ using arb with 3333-bit precision. The rightmost columns show the error bound of the last coefficient with the respective method.

n	Brent-Kung	Kinoshita-Li	Speedup	[x^(n-1)] (Brent-Kung)	[x^(n-1)] (Kinoshita-Li)
100	0.0175	0.023	0.761	[5e+17 +/- 1.53e-984]	[5e+17 +/- 1.61e-984]
200	0.0621	0.0575	1.080	[2e+37 +/- 1.23e-964]	[2e+37 +/- 1.17e-964]
400	0.225	0.154	1.461	[8e+76 +/- 8.11e-925]	[8e+76 +/- 7.63e-925]
800	0.827	0.39	2.121	[2e+156 +/- 3.75e-845]	[2e+156 +/- 3.55e-845]
1600	2.538	1.008	2.518	[2e+315 +/- 8.40e-686]	[2e+315 +/- 8.05e-686]
3200	8.114	2.715	2.989	[5e+633 +/- 4.32e-367]	[5e+633 +/- 4.20e-367]
6400	27.16	8.036	3.380	[8e+1270 +/- 1.18e+271]	[8e+1270 +/- 1.15e+271]
12800	100.991	24.848	4.064	[3e+2545 +/- 8.82e+1545]	[3e+2545 +/- 8.76e+1545]

A significant observation is that Kinoshita-Li seems to be as numerically stable as Brent-Kung. This was not clear a priori; some "asymptotically fast" algorithms actually turn out to be useless numerically, but that is fortunately not the case here.

FLINT now uses Kinoshita-Li for power series composition and reversion for all types for sufficiently large input. Brent-Kung is still used for small input where it is faster. The cutoff between Brent-Kung and Kinoshita-Li depends on the type but usually seems to be somewhere between 100 and 1000, which seems pretty reasonable based on asymptotics.

Trigonometric functions of power series

For the trigonometric functions, I first prompted Claude to work out the basecase recurrence relations for tangent coefficients which it did without difficulty (surely having seen the problem before in mathematical literature). I then showed it _gr_poly_exp_series_basecase and asked it to implement the tangent recurrence analogously. Claude's first version was correct but missed a symmetry optimization and needlessly recomputed some coefficients over and over instead of storing them in an array; it was able to optimize these aspects when I pointed them out.

My next goal was to optimize the Newton iteration for tangents. My initial prompt was as follows:

Attached is the FLINT implementation of the tangent of a power series implemented using Newton iteration. Some relevant helper functions are included in the same file for reference. I would like you to optimize _gr_poly_tan_series_newton: currently in each Newton step, the arctangent integral(f'(x)/(1+f(x)^2)) is computed from scratch. Instead, compute it incrementally. That is, compute the underlying power series inverse Q = 1/(1+f(x)^2) along with the tangent; when extending the precision of the tangent from m to n, do a Newton step to extend the precision of Q from m to n as well, then do a mulmid and a partial integral to extract only the high n - m coefficients of the arctangent actually needed to update the tangent.

This got rather interesting: Claude's first version had multiple bugs and it took several prompts to get it to sort them all out. Claude also missed some algebraic optimizations that I had to point out. Then I asked Claude if it could find any further optimizations to the algorithm looking at _gr_poly_exp_series_newton for ideas. This led Claude to propose variations which were both buggy and slower. I also asked it to generalize the code to allow computing the tangent variants (tanh, cot, coth, tan_pi, tanh_pi). It almost got there, but was unable to figure out the cause of a sign error in the cotangent. After an hour or so of Claude proposing several incorrect fixes and producing a long stream of analysis where it mostly seemed to confuse itself in a loop, I just fixed the sign error myself. At least when finally being shown the fix, Claude also seemed to "get it".

Implementing Newton iteration for sine and cosine went a bit more smoothly. My first prompt was as follows:

Now consider the problem of computing sine and/or cosine of a power series. One way is to apply Newton iteration to the arcsine/arccosine, but those functions involve a square root of a power series which is a bit more expensive than a reciprocal. Another way to do it is by first computing the tangent, then doing a division (sin(f) = 2u/(1+u^2) where u = tan(f/2) etc). This costs an extra division. Yet another idea is to compute exp(f*i) where i = sqrt(-1), but this may require a ring extension; we would like to stay with real numbers when f is real, for example. Can you derive an efficient simultaneous Newton iteration for cos and sin that doesn't require a ring extension, either starting from exp(f*i) and simplifying things algebraically, or working from the ODE/functional equations for sine/cosine and reasoning by analogy with the ordinary exponential function?

This led Claude very correctly to work out a version of the Newton iteration for $\exp(if)$ representing the sine and cosine "algebraically" without use of complex coefficients. Claude's first implementation was buggy, but it took just a few prompts to fix it. Claude had no difficulty implementing the Karatsuba multiplication trick, and translating the gr_poly version to fmpq_poly also went pretty smoothly.

Finally integrating all this code in FLINT (with appropriate cutoffs between the various algorithms) resulted in some nice speedups for trigonometric power series functions in FLINT in many cases; see the PRs linked above for various timing examples.

Verdict

Overall, I've found some aspects of Claude's performance very impressive. It is able to perform tasks which require mixing mathematical reasoning with low-level implementation details, sometimes at the level of an experienced developer. It is able to perform tasks requiring complex reasoning by analogy along the lines of "here is algorithm A to compute function F for type T; design and implement an analogous algorithm B to compute function G for type U". It seems to have no problem understanding and adapting to FLINT's coding conventions.

Claude does make mistakes of every kind: sign errors, attempting to call functions that don't exist, getting sizes mixed up, doing operations in the wrong order, doing redundant operations, forgetting to update comments. But to be fair, I make all the same mistakes myself, and I only get things right by meticulously testing and self-reviewing. Interestingly, Claude seems quite good at finding bugs in its own code when one simply points out their existence, so it could conceivably perform more reliably with some kind of improved self-review feedback loop. This may simply be a question of limited resources; I've just been trying out the €20/month version of Claude, and I did hit usage limits during my experiments.

Despite the long sessions reviewing and debugging Claude's output, I think I spent less time implementing the new functions using Claude than I would have spent doing everything myself. Perhaps half the time? I spent as much time testing, profiling and integrating the final code in FLINT as I would have done normally, but certainly leaving some of the most technical work to the machine feels more "ergonomic".

I will probably try more experiments along these lines in the near future. In addition, I'm curious to examine how AIs perform for the following tasks:

Fixing existing bugs
Discovering bugs
Generating and extending wrapper code for downstream projects (Python-FLINT etc)
Refactoring code

I have a hunch that letting the AI loose on very broad and vaguely-defined missions ("refactor all of FLINT") is going to result in more noise than signal; my experience so far is that it helps to formulate a specific task where I have a decent idea in advance what the result ought to look like.

Does FLINT require a specific AI policy going forward? Currently, I don't think so. Reviewing AI-written code doesn't seem fundamentally different from reviewing human-written code, and FLINT has so far been spared from waves of "AI slop" that have hit higher-profile projects. In fact, I would encourage people interested in using or developing for FLINT to try AI assistance. The latest models seem to understand the FLINT API and internals well enough to help out with all the C boilerplate and memory management, if nothing else. I would just recommend reviewing and testing any generated code before sharing it.

fredrikj.net | Blog index | RSS feed | Follow me on Mastodon | Become a sponsor