Citing papers in software
June 22, 2021
Acknowledging previous work is a fundamental part of scientific communication, and there is growing awareness of the importance of properly citing software in research papers (see, for example, Nature; The Journal of Open Source Software; swMATH; Software citation principles).
However, there is also a reverse problem, which I don't see being discussed nearly as much: properly citing research papers in software. I will not mention any names here, but I have run across projects that provide virtually no explicit pointers to the literature, despite clearly building on algorithms, theorems, heuristics, implementation techniques and application considerations from hundreds of papers (and indeed, from other software too). I've been frustrated sufficiently many times by such code that this topic deserves a mini-rant.
Why and how to provide references
There seem to be two schools of thought when it comes to commenting and documenting software. The literate programming school holds that software should be written like books, where any code is fully explained using surrounding natural language narrative. The opposite school holds that code should be self-documenting (having logical structure, following consistent conventions, using clear type signatures, and so on), with clarifying comments thrown in only exceptionally. Regardless of which school you follow, providing relevant references to external literature in comments and/or documentation is good practice. (If you have a religious aversion to verbose documentation, you can think of it this way: someone already did the job of spelling out the details somewhere else, so a citation is the most economical way to provide the same information!)
If your software uses the results of paper X (or software X), what are the benefits of explicitly citing X in your code or documentation?
- First of all, it benefits users and developers of your software, because it helps them understand how the software works and it can help as a starting point for further research.
- Second, it benefits people who are interested in the general domain. If they search for X, they may find your citation and discover your software (and other references therein) as a result.
- Third, it benefits the authors of X since they get the recognition they deserve.
- Fourth, it will save your future self some trouble when you go back to the code years later and need to figure out what the heck it does.
Any form of citation is better than no citation, but the form also matters, especially as far as points two and three above are concerned. A brief comment in the source code like
/* using an algorithm by Brent */may be enough for developers and inquisitive users; I can probably figure out which Brent and which paper with a bit of digging. An explicit citation like
/* Implements the algorithm in Section 5 in R. P. Brent, Fast multiple-precision evaluation of elementary functions, Journal of the ACM 23 (1976), 242-251. https://doi.org/10.1145/321941.321944 */is better. Even better is if the citation appears in publicly visible documentation, where casual users can find it easily and where it can be discovered through search engines. Scientific software with excellent documentation will typically provide explicit references on a function-by-function basis (though providing references at the level of classes or modules may be appropriate in some circumstances). Best of all is if the citation also appears in a formal bibliography section in a PDF, either as part of a self-published reference manual or in a a formal paper about the software. This makes it more likely that the citations will be picked up by indices like Google Scholar. Software papers published in standard journals are probably the most reliable place to put citations for this purpose, though such documents typically have the disadvantage of only providing static snapshots whereas self-published software manuals can be updated with new references as the software evolves.
Why software often lacks citations
I suppose citations in source code and software documentation tend to be omitted mostly due to a combination of the following factors:
- The general tendency to neglect software documentation (out of laziness or because of lack of incentives to write good documentation).
- Viewing literature as "folklore", implicitly assuming (often quite incorrectly) that anyone who is interested in the software already understands the theory and knows where to find the references.
- Considering software disjoint from or less formal than other forms of scientific communication, and thus not subject to standards that apply e.g. to journal papers.
Viewpoint (3) seems to be unfortunately prevalent. For example, the Academic Integrity at MIT handbook says the following:
Whenever you take information from a source, whether that source is published on paper, presented in a lecture or broadcast, or made available online, you must tell your reader where the information came from: that is, you must cite your source.
In writing a computer program, it means:You use comments to credit the source of any code you adapted from an open source site or other external sources. Generally, providing a URL is sufficient. You also need to follow the terms of any open source license that applies to the code you are using.
This omission here is rather remarkable: according to the MIT handbook, when it comes to software, the only thing that matters is providing attribution when you copy source code; the handbook makes absolutely no mention of a need for attribution when copying (or building on) ideas in software. This seems to reflect the antiquated view of software and source code as mechanical artifacts (instructions for the machine to execute) and not as a medium in its own right for communicating scientific ideas between humans.
This also touches on the larger cultural problem of software being under-appreciated in academia. Since software documentation counts far less than (say) Nature publications in the academic ranking game, there are few external incentives to invest time in documenting software well. Journal and conference papers about software are becoming increasingly common, but as mentioned above, they typically become outdated as the software develops. Moreover, such papers often give high-level overviews without being able to go into detail, especially regarding implementation details. Unfortunately, there are very few good publication venues (outside of self-published manuals) for discussing software implementation; at least in my experience, "describes implementation details" is a standard reason for rejecting papers, not for accepting them.
Deliberate fraud (i.e. purposefully omitting references to give the impression that something is your own invention) could be added to the list, but this is probably comparatively rare. If you find some poorly-documented software and contact the developers, they will probably be happy to tell you that such-and-such function actually uses this-and-that result, but that there just weren't enough hours in the day, joules in the laptop battery or empty lines in the source file left to write it down.
My own track record
I'm personally quite guilty of under-citing, though I have become more concerned about the issue over time. When I started mpmath in 2007, I did not care about references at all, in large part because it was just a hobby project and not part of any academic work. The bibliography in the mpmath documentation has some 20 references. This covers some essentials, but considering the scope and variety of functions and algorithms in mpmath, this bibliography is probably too small by an order of magnitude. If mpmath was a paper or a book being submitted for publication today, any competent editor ought to reject it outright!
My more recent projects hold up better: the bibliography for Arb has some 70 entries; this still omits a lot of direct references, but it does include self-citations for some 15 conference and journal papers I have written about algorithms in Arb, and these papers provide more detailed references and attribution.
In short, if we expect software to be a first-class citizen of science, then high standards of scientific communication (formal attribution of previous work) should apply to the software itself. The burden here is ultimately on the developers, but other parties may be able to help by improving the incentives for good documentation.
The sheer hassle of writing down sources should not be underestimated. This is usually my least favorite part of writing papers, and not much fun when documenting software either. Better tooling could help. For example, I could imagine having a tool that automatically detects citations in source code comments (including references that are not in a particular format) and builds a bibliography. Such tools could even be run to build indices across multiple git repositories without requiring expensive per-project documentation maintenance. However, this presupposes that developers at least make the effort to put in citations in some form in the first place.