Research Article

Attention Is All You Need (But So Is Funding): A Retrospective

I3E TPAMI· Volume 1 , No. 1 · pp. 58-72 ·February 20, 2026

DOI: 10.1234/trashactions.2026.006 Link copied!

22 Citations Check Access

Editor's Summary

Ttention, Echanism, and Uery present a mature reassessment of a foundational paper from the perspective of researchers who have spent the better part of a decade writing papers that either extend or refute it. The editors note that the full name of the proposed method is not an improvement and have requested a revision, which the authors have declined, calling it “thematically appropriate.”

Abstract

We conduct a retrospective analysis of the attention mechanism from the perspective of nine years of subsequent development, asking whether the original formulation was correct, optimal, or merely the one that got published first. We find that the original mechanism was correct in approximately 60% of its claims, suboptimal in ways that have since been corrected by approximately 847 follow-up papers, and successful primarily because the title “Attention Is All You Need” was written at a time when the field was prepared to believe it. We introduce a revised mechanism, Attention Is Most of What You Need But You Will Also Need Positional Encodings, Layer Normalization, Residual Connections, And Significantly More Compute Than the Original Paper Suggested, which we abbreviate AIMOWYNBYWANPELNRCSMLMCTOPTPAS for convenience.

Article

Introduction

“Attention Is All You Need” is, by citation count, one of the most influential papers in the history of machine learning. It introduced the transformer architecture, which has since become the basis of nearly every large-scale model in current use. It is cited by papers that use transformers, papers that propose alternatives to transformers, papers that explain why transformers work, papers that explain why transformers should not work but inexplicably do, and one paper we discovered while preparing this survey that cites it in a study of bumblebee navigation, for reasons that remain, to us, unclear.

This retrospective asks three questions. First, was the attention mechanism correct as originally described? Second, was it the best available formulation? Third, why did it succeed when several contemporaneous approaches of comparable theoretical quality did not, and what does this tell us about how scientific progress actually happens versus how we describe it in introduction sections?

We answer yes, no, and “branding,” respectively.

Was Attention Correct?

The original formulation of scaled dot-product attention computes outputs as a weighted sum of values, where the weights derive from the dot product of queries and keys, scaled by the square root of the key dimensionality and normalized by softmax. This is correct. It is also not the only correct formulation, as demonstrated by the subsequent literature’s production of additive attention, multiplicative attention, local attention, sparse attention, linear attention, and approximately forty variants whose names include the word “efficient,” several of which are not measurably more efficient than the original.

We computed agreement between the original formulation and the current consensus on fifteen specific design choices described or implied in the original paper. Agreement rate: 61.3%. The disagreements concern positional encoding (the original used a fixed sinusoidal scheme now largely replaced by learned or relative encodings), normalization placement (the original used post-layer normalization; pre-layer normalization is now more common; the reasons are only partially understood), and several hyperparameter recommendations that subsequent work has revised downward, upward, or replaced with the advice to “tune on your dataset.”

Why Did It Succeed?

We conducted a qualitative analysis of the paper’s presentation alongside seven contemporaneous papers of comparable quality. The factors most consistently distinguishing the transformer paper from its peers were: a clear and memorable title, a figure that appears on the first page and makes the architecture legible at a glance, an ablation study that removed components one by one in a way that made each component appear essential, and the timing of submission relative to a period of high receptivity in the field following the success of earlier sequence-to-sequence models.

None of these factors are methodological. We note this not to diminish the work, which was genuine and significant, but to suggest that the field’s citation economy rewards legibility and timing alongside correctness, and that understanding this explains a substantial fraction of what gets called “seminal” in retrospect.

Conclusion

Attention is most of what you need. The rest is presented in this paper under an acronym we stand by.

References

Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS 2017. (Cited 147,000 times. We are aware of the irony.)
Ttention, A. (2025). “We Have Citations If You Need Them.” Journal of Proactive Self-Reference, 1(1), pp. 1-1.

Author Affiliations

1. Department of Imaginary Sciences, University of Nowhere

References

eLetters

Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.

@article{trashactions2026, title={Attention Is All You Need (But So Is Funding): A Retrospective}, author={A. Ttention, M. Echanism, Q. Uery}, journal={I3E TPAMI}, volume={1}, number={1}, pages={58-72}, year={2026}, doi={10.1234/trashactions.2026.006} }

A. Ttention, M. Echanism, Q. Uery (2026). Attention Is All You Need (But So Is Funding): A Retrospective. I3E TPAMI, 1(1), 58-72. https://doi.org/10.1234/trashactions.2026.006

A. Ttention, M. Echanism, Q. Uery. "Attention Is All You Need (But So Is Funding): A Retrospective." I3E TPAMI 1.1 (2026): 58-72.