Article
Introduction
Benchmarks are the currency of machine learning research. A model that achieves state of the art on a benchmark receives citations, press coverage, and occasionally a well-funded startup. A model that performs poorly on a benchmark is either a research failure or evidence that the benchmark is flawed, depending on whether the model was developed by the same group that designed the benchmark. This paper is concerned with a third possibility: the model is performing well on the benchmark because it has seen the benchmark before, which raises the question of whether it is performing well at all.
We introduce the term “benchmark contamination” to describe the phenomenon of test data appearing in training data, though we acknowledge that this term has been introduced approximately forty-seven times in the past five years by researchers who apparently did not benchmark-check each other’s literature reviews. Our contribution is to perform the most comprehensive contamination analysis to date, covering 127 benchmarks across twelve task categories, and to report results that are, even by the standards of this field, quite bad.
The remainder of this paper is structured as follows. Section 2 reviews related work, which we are confident we have read, though we acknowledge that some of the citations may have been hallucinated by our writing assistant. Section 3 describes our methodology. Section 4 presents results. Section 5 discusses implications. Section 6 concludes.
Methodology
Our contamination detection pipeline, BenchmarkSniffer v1.0, operates as follows. Given a benchmark test set $\mathcal{T} = {t_1, t_2, \ldots, t_n}$ and a pretraining corpus $\mathcal{C}$, we compute the longest common substring between each $t_i$ and any document in $\mathcal{C}$ using a sliding window of size 32 tokens. A test example is flagged as contaminated if the overlap exceeds a threshold $\tau$ that we selected by running our method on the validation set until we got a number we could plausibly defend. We set $\tau = 0.7$. Sensitivity analyses with $\tau \in {0.5, 0.6, 0.8}$ all produced results we preferred less.
We applied BenchmarkSniffer to 127 benchmarks and three pretraining corpora: Common Crawl (2019–2023), The Pile, and a corpus labeled “misc_internet_text_do_not_use_v3.jsonl” that we found on a shared lab server and elected not to investigate further.
Results
Of the 127 benchmarks analyzed, 119 (93.7%, rounded to 94% in the abstract for rhetorical impact) contained at least one contaminated example. Of the eight apparently clean benchmarks, three were synthetic datasets we generated ourselves during this study and therefore could not have appeared in training data. Of the remaining five, four have been released since January 2026 and one is a benchmark for evaluating contamination detection methods, which we find philosophically interesting.
Models evaluated on the contaminated benchmarks achieved a mean accuracy of 84.3%. The same models, evaluated on our clean held-out re-evaluation set constructed from the same task distributions, achieved a mean accuracy of 51.2%, which is marginally above chance for most tasks. We note that 51.2% is, technically, above chance, and we encourage the reader to weigh this carefully before describing the field’s progress as nonexistent.
Discussion
These results suggest that a substantial fraction of what we have been calling “generalization” in language models is better described as “remembering the test answers.” We do not think this is anyone’s fault in particular, except possibly the fault of the researchers who built the benchmarks, trained on the internet, and evaluated without checking, which is to say, everyone.
We considered recommending specific remedies, including dynamic benchmark generation, held-out corpora with restricted access, and mandatory contamination audits. We have not implemented any of these for our own benchmark and do not intend to.
References
- Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
- Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
- Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
- Benchmark, B., & Contamination, C. (2021). “Preliminary Evidence That Our Benchmarks Are Fine.” Workshop on Benchmarks We Haven’t Checked Yet, pp. 1-4.
- Test, T., et al. (2020). “There Is No Problem.” Proceedings of the International Conference on Reassurance, pp. 200-200.
Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.