Skip to main content

I Tried to Reproduce My Own Paper. It Took Three Years.

I. Rreproducible

I reproduced my own results last month and failed.

This is funnier than it sounds, which I acknowledge is not very funny. I had published a paper fourteen months prior demonstrating that our method achieved state-of-the-art performance on three standard benchmarks. The paper was accepted. The reviewers were, on balance, positive. The code was made available on GitHub, where it has accumulated 47 stars and three issues, all opened by the same user, asking the same question, in progressively less polite language.

I returned to the code to run an extended experiment for a follow-up paper. The experiment did not reproduce.

Investigation

I spent three days determining why. The findings:

Finding 1: The random seed I had used was not the seed specified in my paper. This was because the paper was written after the experiments, and I had used a different seed in the draft that became the camera-ready version. The seed in the paper produces worse results. I do not know when this divergence occurred.

Finding 2: One of my preprocessing steps depended on a library version that had since been updated. The update changed a default parameter. This change is documented in the library’s changelog, which I had not read, because I was busy not reading it.

Finding 3: The baseline I compared against — a method from a 2022 paper, labeled in my work as “the current state of the art” — had, by the time of my follow-up experiment, been superseded by four other methods, two of which were published in the same proceedings as my original paper and which I had not cited because I submitted before the proceedings were finalized and had not updated my related work since.

Implications

None of this is unusual. This is a precise description of normal scientific behavior in a field that produces several thousand papers per month and moves faster than its own literature review process. We are all, collectively, publishing into a river that is moving faster than we can observe it.

The reproducibility crisis is not a crisis in the sense of an acute emergency requiring intervention. It is a chronic condition, like a structural deficiency in a building that has been occupied for decades: the building continues to function; people continue to use it; the deficiency is documented in several reports that were circulated and then filed.

A Modest Proposal

I propose that we add a new field to every paper submission form: “Will the corresponding author still be at this email address in 18 months?” The answer is almost never yes. The data is almost never available. The code is almost never runnable.

I also propose that we celebrate papers that fail to reproduce their own results, on the grounds that discovering this failure is itself a contribution. I have a paper on exactly this topic, which I am preparing for submission. Its results are preliminary. I have not yet attempted to reproduce them.

The author’s code is available at a GitHub repository that will return a 404 error by the time you read this.