Research Article

Our Benchmark Is Contaminated (And Yours Probably Is Too)

I3E TPAMI· Volume 1 , No. 1 · pp. 1-12 ·February 13, 2026

DOI: 10.1234/trashactions.2026.001 Link copied!

47 Citations Check Access

Editor's Summary

Benchmark and colleagues present compelling evidence that the field of natural language processing has been measuring memorization rather than generalization for at least four years. The editors note that this paper was itself reviewed using a rubric that may have appeared in GPT-2’s training data. We are investigating.

Abstract

We present a systematic analysis of 127 widely-used evaluation benchmarks and demonstrate that 94% contain at least one test example that appears verbatim in publicly available LLM pretraining corpora dating to 2021 or earlier. Our contamination detection methodology, which we call BenchmarkSniffer, achieves 97.3% precision on a held-out validation set that we have not yet checked for contamination. We further show that models achieving state-of-the-art results on the contaminated benchmarks perform at chance level on our clean re-evaluation set, a finding we describe as “surprising” despite it being the only logical outcome. We conclude that the field’s progress over the past four years may be best characterized as “memorization at scale.”

Article

Introduction

Benchmarks are the currency of machine learning research. A model that achieves state of the art on a benchmark receives citations, press coverage, and occasionally a well-funded startup. A model that performs poorly on a benchmark is either a research failure or evidence that the benchmark is flawed, depending on whether the model was developed by the same group that designed the benchmark. This paper is concerned with a third possibility: the model is performing well on the benchmark because it has seen the benchmark before, which raises the question of whether it is performing well at all.

We introduce the term “benchmark contamination” to describe the phenomenon of test data appearing in training data, though we acknowledge that this term has been introduced approximately forty-seven times in the past five years by researchers who apparently did not benchmark-check each other’s literature reviews. Our contribution is to perform the most comprehensive contamination analysis to date, covering 127 benchmarks across twelve task categories, and to report results that are, even by the standards of this field, quite bad.

The remainder of this paper is structured as follows. Section 2 reviews related work, which we are confident we have read, though we acknowledge that some of the citations may have been hallucinated by our writing assistant. Section 3 describes our methodology. Section 4 presents results. Section 5 discusses implications. Section 6 concludes.

Methodology

Our contamination detection pipeline, BenchmarkSniffer v1.0, operates as follows. Given a benchmark test set $\mathcal{T} = {t_1, t_2, \ldots, t_n}$ and a pretraining corpus $\mathcal{C}$, we compute the longest common substring between each $t_i$ and any document in $\mathcal{C}$ using a sliding window of size 32 tokens. A test example is flagged as contaminated if the overlap exceeds a threshold $\tau$ that we selected by running our method on the validation set until we got a number we could plausibly defend. We set $\tau = 0.7$. Sensitivity analyses with $\tau \in {0.5, 0.6, 0.8}$ all produced results we preferred less.

We applied BenchmarkSniffer to 127 benchmarks and three pretraining corpora: Common Crawl (2019–2023), The Pile, and a corpus labeled “misc_internet_text_do_not_use_v3.jsonl” that we found on a shared lab server and elected not to investigate further.

Results

Of the 127 benchmarks analyzed, 119 (93.7%, rounded to 94% in the abstract for rhetorical impact) contained at least one contaminated example. Of the eight apparently clean benchmarks, three were synthetic datasets we generated ourselves during this study and therefore could not have appeared in training data. Of the remaining five, four have been released since January 2026 and one is a benchmark for evaluating contamination detection methods, which we find philosophically interesting.

Models evaluated on the contaminated benchmarks achieved a mean accuracy of 84.3%. The same models, evaluated on our clean held-out re-evaluation set constructed from the same task distributions, achieved a mean accuracy of 51.2%, which is marginally above chance for most tasks. We note that 51.2% is, technically, above chance, and we encourage the reader to weigh this carefully before describing the field’s progress as nonexistent.

Discussion

These results suggest that a substantial fraction of what we have been calling “generalization” in language models is better described as “remembering the test answers.” We do not think this is anyone’s fault in particular, except possibly the fault of the researchers who built the benchmarks, trained on the internet, and evaluated without checking, which is to say, everyone.

We considered recommending specific remedies, including dynamic benchmark generation, held-out corpora with restricted access, and mandatory contamination audits. We have not implemented any of these for our own benchmark and do not intend to.

References

Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
Benchmark, B., & Contamination, C. (2021). “Preliminary Evidence That Our Benchmarks Are Fine.” Workshop on Benchmarks We Haven’t Checked Yet, pp. 1-4.
Test, T., et al. (2020). “There Is No Problem.” Proceedings of the International Conference on Reassurance, pp. 200-200.

Author Affiliations

1. Department of Imaginary Sciences, University of Nowhere

References

eLetters

Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.

@article{trashactions2026, title={Our Benchmark Is Contaminated (And Yours Probably Is Too)}, author={B. Enchmark, C. Ontamination}, journal={I3E TPAMI}, volume={1}, number={1}, pages={1-12}, year={2026}, doi={10.1234/trashactions.2026.001} }

B. Enchmark, C. Ontamination (2026). Our Benchmark Is Contaminated (And Yours Probably Is Too). I3E TPAMI, 1(1), 1-12. https://doi.org/10.1234/trashactions.2026.001

B. Enchmark, C. Ontamination. "Our Benchmark Is Contaminated (And Yours Probably Is Too)." I3E TPAMI 1.1 (2026): 1-12.