A short comedy of ideas.
For Christy
Sequel to Grammagraph: How Typed AI Compresses Syntax Into Semantics
The opening scene is Vaswani having proudly written in chalk: “Attention Is All You Need.”
ChatGPT Prompt
Then Noether comes and replaces Attention with Symmetry.
As chaos, hilarity and enlightenment ensue, Hinton (carrying GPUs) and Taco (carrying the burden of being correct) help defuse the conflict.
⸻
The blackboard reads:
VASWANI stands beneath it like a man who has successfully invented fire.
Enter NOETHER.
She studies the board.
A long pause.
She removes a piece of chalk.
Adds three words.
The board now reads:
SYMMETRY IS ALL ATTENTION NEEDS
She steps back.
VASWANI stares.
NOETHER waits.
VASWANI:
You changed my paper.
NOETHER:
I corrected your dependency graph.
VASWANI:
That isn’t what the title means.
NOETHER:
No, but it’s what the title requires.
VASWANI:
Attention discovers what matters.
NOETHER:
Among what possibilities?
VASWANI:
All possibilities.
NOETHER:
That is an alarming answer.
VASWANI:
Why?
NOETHER:
Because all possibilities is another name for ignorance.
Pause.
VASWANI:
Attention learns structure.
NOETHER:
Attention spends computation looking for structure.
There is a difference.
VASWANI:
A distinction without a difference.
NOETHER:
On the contrary. One is exploration.
The other is geography.
Enter HINTON carrying several GPUs.
HINTON:
What’s happening?
VASWANI:
She’s trying to replace attention with symmetry.
NOETHER:
No.
I’m trying to explain where attention gets its lunch.
HINTON freezes.
HINTON:
Oh dear.
VASWANI:
What?
HINTON:
That’s a very dangerous sentence.
NOETHER:
Suppose I give you a million parameters.
HINTON:
Reasonable.
NOETHER:
And suppose I tell them nothing about the world.
HINTON:
Also reasonable.
NOETHER:
Then they spend their lives discovering that a cat remains a cat when moved three pixels to the left.
VASWANI:
That’s learning.
NOETHER:
That’s bookkeeping.
HINTON:
She’s got a point.
VASWANI:
You’ve changed sides already?
HINTON:
I was a neural net guy.
Then I became a representation guy.
I’m trying to stay ahead of events.
Enter TACO.
TACO:
Why are you all yelling?
VASWANI points.
NOETHER points.
Both begin speaking simultaneously.
TACO raises a hand.
Silence.
TACO approaches the board.
Reads it.
Nods.
TACO:
Looks right.
VASWANI:
What?
TACO:
The sentence.
VASWANI:
Attention changed AI.
TACO:
Certainly.
VASWANI:
Then how can that sentence be right?
TACO:
Because changing AI and being fundamental are different achievements.
VASWANI:
Explain.
TACO:
A sail changes a ship.
The ocean remains more fundamental.
VASWANI:
Attention is not a sail.
TACO:
No.
It is an extraordinarily clever way of navigating.
NOETHER smiles.
VASWANI notices.
VASWANI:
Stop encouraging him.
TACO:
Her.
VASWANI:
Her.
TACO:
Also, stop making my point.
NOETHER:
The question is simple.
What allows generalization?
VASWANI:
Attention.
NOETHER:
Then why do convolutional networks work?
VASWANI:
Translation symmetry.
NOETHER:
And equivariant networks?
VASWANI:
Symmetry.
NOETHER:
And weight sharing?
VASWANI:
…
NOETHER:
And compression?
VASWANI:
…
NOETHER:
And abstraction?
VASWANI:
This feels rigged.
NOETHER:
Reality often is.
A long pause.
VASWANI studies the board.
HINTON studies the board.
TACO studies everyone studying the board.
VASWANI:
Suppose you’re right.
Suppose symmetry comes first.
What does attention do?
NOETHER considers.
Then she erases a small space beneath the sentence and writes:
Attention discovers where symmetry should break.
Silence.
Even TACO looks pleased.
HINTON:
That’s annoyingly elegant.
VASWANI:
I hate it.
NOETHER:
That is often the first symptom.
Blackout.
Appendix I: Who’s Arguing What, and Why
The comedy works because nobody is actually wrong.
They are arguing at different layers of explanation.
Vaswani: Intelligence Is Attention
Vaswani represents the transformer revolution.
His claim is:
Intelligence emerges from dynamically determining what matters.
The central insight is that fixed structure is often less useful than learned relationships.
Given enough data and enough capacity, attention can discover the relevant patterns for itself.
His fear is that symmetry enthusiasts want to hard-code assumptions that should instead be learned.
When Noether edits the title, he hears:
Attention was a mistake.
Which is not what she is saying.
Noether: Intelligence Runs on Symmetry
Noether represents the deeper mathematical view.
Her claim is:
Structure exists before learning.
Every successful learning system exploits invariances:
Without these symmetries, learning becomes an expensive rediscovery of obvious facts.
Her position is not:
Attention is useless.
Her position is:
Attention is downstream.
Attention is valuable because the world already contains structure worth attending to.
Her objection to the original title is that it mistakes a mechanism for a foundation.
Hinton: Intelligence Is Representation
Hinton occupies the middle ground.
His claim is:
The real question is how the world gets represented.
He is sympathetic to both sides:
- Attention matters
- Symmetry matters
But both ultimately serve representation.
He functions as the audience’s guide through the argument, because he can understand both revolutions simultaneously.
Taco: The Stack Exists
Taco is the only person who is not surprised.
His claim is:
You are all arguing over adjacent layers of the same system.
He sees:
- symmetry as the source of inductive bias
- representation as the learned internal geometry
- attention as the routing mechanism
His role in the play is not to win.
His role is to explain why nobody has won.
He is the mediator between ontology and implementation.
What The Audience Thinks At First
The audience initially assumes:
Intelligence → Attention → Everything Else
This is roughly the worldview implied by the title on the blackboard.
What Noether Claims
Noether proposes a different hierarchy:
Intelligence → Symmetry → Representation → Attention
Attention remains important.
But it is no longer fundamental.
This is why her chalk correction feels like an attack.
She is not editing the sentence.
She is editing the dependency graph.
The Hidden Resolution
The final line reveals that the characters have been talking past each other.
The real claim is neither:
Attention is all you need.
Nor:
Symmetry is all you need.
Instead:
Attention discovers where symmetry should break.
This preserves both insights.
- Symmetry provides compression.
- Attention provides adaptation.
- Intelligence emerges from their interaction.
Or, in Noether’s preferred formulation:
Symmetry is all attention needs.
Except where it isn’t.
Appendix II: How Symmetry Hid In Plain Sight
The strange thing about symmetry is not that it is important.
The strange thing is how often it appears under different names.
The history of machine learning is full of researchers discovering symmetry, using symmetry, depending on symmetry, and then discussing something else.
It is the most successful idea that rarely gets top billing.
Convolution Was Symmetry
Convolutional neural networks were presented as a practical engineering solution.
- Images contain local structure.
- Weights should be shared.
- Translation should not matter.
This worked spectacularly.
But what was really being exploited?
The network succeeded because it knew in advance that moving a cat three pixels to the left does not create a different animal.
Millions of parameters were spared the burden of rediscovering this fact.
Compression Was Symmetry
Compression works because many things are effectively the same.
The more symmetry a domain contains, the more aggressively it can be compressed.
A dictionary, a grammar, a scientific law, and a JPEG all perform the same miracle:
They discover equivalence classes.
They find transformations that preserve meaning.
In retrospect, compression was always symmetry bookkeeping.
Abstraction Was Symmetry
A child learns that many different dogs are all “dog.”
Different attributes:
- Different colors
- Different sizes
- Different positions
- Different lighting conditions
Same concept.
Abstraction is the discovery of invariance.
It is symmetry viewed from the inside.
Generalization Was Symmetry
The central mystery of learning is not memorization.
The central mystery is generalization.
Why does experience with one thing help us understand another?
The answer is surprisingly simple:
Because the two situations share structure.
Every successful generalization is a claim that some transformation preserves meaning.
In other words:
Generalization is symmetry recognition.
Physics Learned This Earlier
Physics began with objects.
It gradually became a science of transformations.
The great discoveries of the twentieth century repeatedly followed the same pattern:
What appeared fundamental turned out to be downstream of symmetry.
Examples:
Again and again the deeper explanation was found in the transformations that left reality unchanged.
- The objects survived.
- The ontology changed.
Machine Learning May Be Repeating The Pattern
The early years focused on neurons.
The scaling years focused on parameters.
The transformer years focused on attention.
- Each step was productive.
- Each step was revolutionary.
- But each step may also have been moving toward a deeper realization.
The important question is not:
How many parameters are there?
Nor:
What receives attention?
But:
What transformations preserve meaning?
The first two questions are implementation details.
The last question is about structure.
Why Nobody Noticed
Because symmetry is invisible when it works.
- Nobody notices translation symmetry when reading a sentence.
- Nobody notices rotational symmetry when recognizing a face.
- Nobody notices conservation laws while catching a ball.
Symmetry removes effort.
- It removes surprise.
- It removes computation.
The better it works, the less visible it becomes.
Like gravity, it disappears into normality.
The Noether Test
A useful thought experiment is to ask:
If this architecture works, what symmetry is it exploiting?
If the answer is clear, the model may be efficient because it understands the domain.
If the answer is unclear, the model may simply be large enough to compensate.
This is not a proof.
But it is often a clue.
The Suspicion
The suspicion behind the play is not that attention is unimportant.
It is that attention may have received credit that properly belongs to symmetry.
Attention is spectacularly useful.
But perhaps its greatest achievement was revealing how much structure was already present.
The possibility is that intelligence is not primarily the art of attending.
It is the art of discovering what remains the same while everything else changes.
And that, for several centuries now, has been called symmetry.
Appendix III: Neural Networks As Symmetry Encoders
The standard description of a neural network is that it is a function approximator.
This is true.
It is also incomplete.
- A lookup table is a function approximator.
- A spreadsheet is a function approximator.
- A sufficiently patient accountant is a function approximator.
What makes neural networks interesting is not merely that they approximate functions.
It is that they generalize.
The mystery is not:
How can a machine memorize?
The mystery is:
How can a machine learn something once and apply it somewhere else?
The answer appears to be symmetry.
The Generalization Problem
Suppose a neural network sees ten thousand cats.
No two are identical.
Variations:
- Different lighting
- Different colors
- Different poses
- Different backgrounds
- Different viewpoints
Yet somehow the network learns:
These are all cats.
This is a remarkable achievement.
The network has discovered that many differences are irrelevant.
In mathematical language, it has discovered an invariance.
In ordinary language, it has discovered a symmetry.
What Learning Actually Learns
From this perspective, learning is not primarily the accumulation of facts.
Learning is the discovery of transformations that preserve meaning.
A successful network learns:
- which differences matter
- which differences do not
- which transformations preserve identity
- which transformations create something new
This is precisely the work performed by symmetry.
The Three Tasks
Every learning system faces three related problems.
1. Invariance
What can change without changing meaning?
Examples:
- location
- lighting
- phrasing
- handwriting
These are symmetries.
2. Equivariance
What should change predictably when the input changes?
A rotated object should produce a correspondingly rotated representation.
A translated image should produce a correspondingly translated activation pattern.
These are structured symmetries.
3. Symmetry Breaking
Where do differences suddenly matter?
Distinctions that matter:
These cannot be explained by symmetry alone.
Intelligence requires knowing where symmetry ends.
Meaning lives in the break.
Attention Revisited
This perspective suggests a different interpretation of attention.
The transformer revolution demonstrated that dynamic routing is extraordinarily powerful.
But routing what?
Not raw reality.
Routing among learned structures.
- Attention does not create symmetry.
- Attention exploits symmetry.
Or more precisely:
Attention discovers where learned symmetries should be applied and where they should be broken.
This makes attention essential.
It does not make attention fundamental.
Why Symmetry Hides
The irony is that successful symmetries become invisible.
- Nobody notices that language is translation invariant with respect to font choice.
- Nobody notices that faces remain faces under modest rotations.
- Nobody notices that scientific laws remain true tomorrow.
The symmetries disappear because they are working.
Like good infrastructure, they become background assumptions.
The Strong Form
The strongest version of this idea is not:
Neural networks use symmetry.
That has been obvious for decades.
The stronger claim is:
Neural networks are symmetry encoders.
Their primary task is not function approximation.
Function approximation is merely the mechanism.
The deeper objective is discovering and representing the invariances, equivariances, and symmetry-breakings that structure a domain.
A Noetherian Interpretation
From this viewpoint, the history of machine learning begins to resemble the history of physics.
Physics gradually discovered that many apparently fundamental phenomena were consequences of symmetry.
Machine learning may be undergoing a similar transition.
The future may not belong primarily to larger models.
It may belong to models that better align their internal symmetries with the symmetries of the world.
If so, neural networks are not best understood as collections of parameters.
They are better understood as machines for encoding symmetry.
The weights are simply where the symmetries live.

Leave a comment