Article
Introduction
The transformer architecture, introduced in 2017 under the title “Attention Is All You Need,” has since become the foundation of the most powerful language models, vision models, and several models whose modality is not entirely clear even to their creators. Understanding how transformers work is therefore important for anyone who wishes to use, extend, criticize, or simply appear knowledgeable about them at dinner parties.
Previous explanations of transformers fall into two categories. Mathematical explanations are complete and rigorous but require the reader to understand linear algebra, probability theory, and the willingness to track four simultaneous matrix computations through a diagram that is technically correct but practically incomprehensible without approximately three years of practice. Intuitive explanations are accessible but typically sacrifice so much precision that the reader finishes them feeling informed while retaining no transferable knowledge, a state we characterize as “educated ignorance” and attempt to measure empirically.
This paper provides an explanation in the second category, but with full acknowledgment that it is in the second category, which we believe constitutes a methodological contribution.
The Explanation
Imagine an office. In this office, everyone has a desk, and on each desk there is a pile of papers. Each paper has a question written at the top. When someone receives a question, they look around the room to find which other person’s desk contains information relevant to that question. They walk to that desk, read the relevant information, and incorporate it into their answer. This process happens simultaneously for everyone in the office, which is either efficiently parallel or completely chaotic depending on your prior experience with open-plan workspaces.
This is attention. The transformer is a building with many such offices, arranged in floors. Each floor does the same thing the previous floor did, but to a slightly more processed version of the papers. By the time you reach the top floor, the papers contain answers that integrate information from across the entire building, which is either a good metaphor for contextual understanding or a description of a coordination problem we have now made worse.
The query, key, and value matrices are, in this metaphor, respectively: the question on the paper, the label on the desk, and the information in the drawer. The dot product is how much the question matches the label. The softmax is a normalizing procedure that ensures the person only takes information from the desks that are most relevant, rather than photocopying everything in the building and carrying it back to their desk, which would be computationally inefficient and also impolite.
User Study
We administered our explanation to 200 participants recruited through an academic survey platform that pays $1.20 per hour and whose ethics review status we have not verified. Participants rated their confidence in understanding transformers at 8.9 out of 10 after reading our explanation, compared to 3.2 out of 10 after reading the original paper (p < 0.001). On a 12-question transfer test involving novel transformer configurations, participants performed at 51.3% accuracy after our explanation and 51.1% accuracy after the original paper (p = 0.94). We interpret this as evidence that both explanations are equally unhelpful for actual skill acquisition and that our explanation is significantly better at making people feel otherwise.
Conclusion
We have explained transformers without mathematics. Whether this constitutes an explanation in any philosophically robust sense is a question we leave to philosophers, for whom we also plan to write an explanation, similarly without mathematics, and similarly of uncertain utility.
References
- Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
- Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
- Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
- Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS 2017. (This one is real. We checked.)
- Omath, N. (2024). “Explaining Convolutional Neural Networks Without Numbers (Retracted).” Journal of Accessible Inaccuracies, 2(1), pp. 3-17.
Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.