Article
Introduction
“Attention Is All You Need” is, by citation count, one of the most influential papers in the history of machine learning. It introduced the transformer architecture, which has since become the basis of nearly every large-scale model in current use. It is cited by papers that use transformers, papers that propose alternatives to transformers, papers that explain why transformers work, papers that explain why transformers should not work but inexplicably do, and one paper we discovered while preparing this survey that cites it in a study of bumblebee navigation, for reasons that remain, to us, unclear.
This retrospective asks three questions. First, was the attention mechanism correct as originally described? Second, was it the best available formulation? Third, why did it succeed when several contemporaneous approaches of comparable theoretical quality did not, and what does this tell us about how scientific progress actually happens versus how we describe it in introduction sections?
We answer yes, no, and “branding,” respectively.
Was Attention Correct?
The original formulation of scaled dot-product attention computes outputs as a weighted sum of values, where the weights derive from the dot product of queries and keys, scaled by the square root of the key dimensionality and normalized by softmax. This is correct. It is also not the only correct formulation, as demonstrated by the subsequent literature’s production of additive attention, multiplicative attention, local attention, sparse attention, linear attention, and approximately forty variants whose names include the word “efficient,” several of which are not measurably more efficient than the original.
We computed agreement between the original formulation and the current consensus on fifteen specific design choices described or implied in the original paper. Agreement rate: 61.3%. The disagreements concern positional encoding (the original used a fixed sinusoidal scheme now largely replaced by learned or relative encodings), normalization placement (the original used post-layer normalization; pre-layer normalization is now more common; the reasons are only partially understood), and several hyperparameter recommendations that subsequent work has revised downward, upward, or replaced with the advice to “tune on your dataset.”
Why Did It Succeed?
We conducted a qualitative analysis of the paper’s presentation alongside seven contemporaneous papers of comparable quality. The factors most consistently distinguishing the transformer paper from its peers were: a clear and memorable title, a figure that appears on the first page and makes the architecture legible at a glance, an ablation study that removed components one by one in a way that made each component appear essential, and the timing of submission relative to a period of high receptivity in the field following the success of earlier sequence-to-sequence models.
None of these factors are methodological. We note this not to diminish the work, which was genuine and significant, but to suggest that the field’s citation economy rewards legibility and timing alongside correctness, and that understanding this explains a substantial fraction of what gets called “seminal” in retrospect.
Conclusion
Attention is most of what you need. The rest is presented in this paper under an acronym we stand by.
References
- Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
- Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
- Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
- Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS 2017. (Cited 147,000 times. We are aware of the irony.)
- Ttention, A. (2025). “We Have Citations If You Need Them.” Journal of Proactive Self-Reference, 1(1), pp. 1-1.
Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.