Skip to main content
Letter

Explaining Transformers Without Any Mathematics Whatsoever

I3E TPAMI· Volume 1 , No. 1 · pp. 46-57 ·
DOI: 10.1234/trashactions.2026.005 Link copied!
128 Citations Check Access

Editor's Summary

Omath and Ibes have written a paper that many readers will find immediately accessible and eventually realize they cannot use. The editors commend this as a faithful model of most machine learning pedagogy. We note that the paper contains no mathematics and also no theorems, proofs, or falsifiable claims, which simplified the review process considerably.

Abstract

We present a complete explanation of the transformer architecture that uses no equations, no Greek letters, no subscripts, and no notation that could not appear in a children’s illustrated book about attention. Our explanation relies entirely on analogies, metaphors, narrative, and one extended comparison to a bureaucratic office that we are increasingly uncertain was helpful. User studies confirm that readers who complete our explanation feel they understand transformers (confidence: 8.9/10) while performing at chance on transfer tasks involving actual transformer architectures (accuracy: 51.3%). We conclude that our explanation is very good at producing the feeling of understanding, which we argue is what most readers wanted anyway.

Article

Introduction

The transformer architecture, introduced in 2017 under the title “Attention Is All You Need,” has since become the foundation of the most powerful language models, vision models, and several models whose modality is not entirely clear even to their creators. Understanding how transformers work is therefore important for anyone who wishes to use, extend, criticize, or simply appear knowledgeable about them at dinner parties.

Previous explanations of transformers fall into two categories. Mathematical explanations are complete and rigorous but require the reader to understand linear algebra, probability theory, and the willingness to track four simultaneous matrix computations through a diagram that is technically correct but practically incomprehensible without approximately three years of practice. Intuitive explanations are accessible but typically sacrifice so much precision that the reader finishes them feeling informed while retaining no transferable knowledge, a state we characterize as “educated ignorance” and attempt to measure empirically.

This paper provides an explanation in the second category, but with full acknowledgment that it is in the second category, which we believe constitutes a methodological contribution.

The Explanation

Imagine an office. In this office, everyone has a desk, and on each desk there is a pile of papers. Each paper has a question written at the top. When someone receives a question, they look around the room to find which other person’s desk contains information relevant to that question. They walk to that desk, read the relevant information, and incorporate it into their answer. This process happens simultaneously for everyone in the office, which is either efficiently parallel or completely chaotic depending on your prior experience with open-plan workspaces.

This is attention. The transformer is a building with many such offices, arranged in floors. Each floor does the same thing the previous floor did, but to a slightly more processed version of the papers. By the time you reach the top floor, the papers contain answers that integrate information from across the entire building, which is either a good metaphor for contextual understanding or a description of a coordination problem we have now made worse.

The query, key, and value matrices are, in this metaphor, respectively: the question on the paper, the label on the desk, and the information in the drawer. The dot product is how much the question matches the label. The softmax is a normalizing procedure that ensures the person only takes information from the desks that are most relevant, rather than photocopying everything in the building and carrying it back to their desk, which would be computationally inefficient and also impolite.

User Study

We administered our explanation to 200 participants recruited through an academic survey platform that pays $1.20 per hour and whose ethics review status we have not verified. Participants rated their confidence in understanding transformers at 8.9 out of 10 after reading our explanation, compared to 3.2 out of 10 after reading the original paper (p < 0.001). On a 12-question transfer test involving novel transformer configurations, participants performed at 51.3% accuracy after our explanation and 51.1% accuracy after the original paper (p = 0.94). We interpret this as evidence that both explanations are equally unhelpful for actual skill acquisition and that our explanation is significantly better at making people feel otherwise.

Conclusion

We have explained transformers without mathematics. Whether this constitutes an explanation in any philosophically robust sense is a question we leave to philosophers, for whom we also plan to write an explanation, similarly without mathematics, and similarly of uncertain utility.

References

  1. Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
  2. Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
  3. Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
  4. Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS 2017. (This one is real. We checked.)
  5. Omath, N. (2024). “Explaining Convolutional Neural Networks Without Numbers (Retracted).” Journal of Accessible Inaccuracies, 2(1), pp. 3-17.

Author Affiliations

1. Department of Imaginary Sciences, University of Nowhere

References

eLetters

Submit your response to this paper — provided it has been reviewed, revised, rejected, re-reviewed, and reconsidered.