Symmetry Is All Attention Needs

A short comedy of ideas.

For Christy

Sequel to Grammagraph: How Typed AI Compresses Syntax Into Semantics

The opening scene is Vaswani having proudly written in chalk: “Attention Is All You Need.”
Then Noether comes and replaces Attention with Symmetry.
As chaos, hilarity and enlightenment ensue, Hinton (carrying GPUs) and Taco (carrying the burden of being correct) help defuse the conflict.
ChatGPT Prompt

⸻

The blackboard reads:

ATTENTION IS ALL YOU NEED

VASWANI stands beneath it like a man who has successfully invented fire.

Enter NOETHER.

She studies the board.

A long pause.

She removes a piece of chalk.

Adds three words.

The board now reads:

SYMMETRY IS ALL ATTENTION NEEDS

She steps back.

VASWANI stares.

NOETHER waits.

VASWANI:

You changed my paper.

NOETHER:

I corrected your dependency graph.

VASWANI:

That isn’t what the title means.

NOETHER:

No, but it’s what the title requires.

VASWANI:

Attention discovers what matters.

NOETHER:

Among what possibilities?

VASWANI:

All possibilities.

NOETHER:

That is an alarming answer.

VASWANI:

Why?

NOETHER:

Because all possibilities is another name for ignorance.

Pause.

VASWANI:

Attention learns structure.

NOETHER:

Attention spends computation looking for structure.

There is a difference.

VASWANI:

A distinction without a difference.

NOETHER:

On the contrary. One is exploration.

The other is geography.

Enter HINTON carrying several GPUs.

HINTON:

What’s happening?

VASWANI:

She’s trying to replace attention with symmetry.

NOETHER:

No.

I’m trying to explain where attention gets its lunch.

HINTON freezes.

HINTON:

Oh dear.

VASWANI:

What?

HINTON:

That’s a very dangerous sentence.

NOETHER:

Suppose I give you a million parameters.

HINTON:

Reasonable.

NOETHER:

And suppose I tell them nothing about the world.

HINTON:

Also reasonable.

NOETHER:

Then they spend their lives discovering that a cat remains a cat when moved three pixels to the left.

VASWANI:

That’s learning.

NOETHER:

That’s bookkeeping.

HINTON:

She’s got a point.

VASWANI:

You’ve changed sides already?

HINTON:

I was a neural net guy.

Then I became a representation guy.

I’m trying to stay ahead of events.

Enter TACO.

TACO:

Why are you all yelling?

VASWANI points.

NOETHER points.

Both begin speaking simultaneously.

TACO raises a hand.

Silence.

TACO approaches the board.

Reads it.

Nods.

TACO:

Looks right.

VASWANI:

What?

TACO:

The sentence.

VASWANI:

Attention changed AI.

TACO:

Certainly.

VASWANI:

Then how can that sentence be right?

TACO:

Because changing AI and being fundamental are different achievements.

VASWANI:

Explain.

TACO:

A sail changes a ship.

The ocean remains more fundamental.

VASWANI:

Attention is not a sail.

TACO:

No.

It is an extraordinarily clever way of navigating.

NOETHER smiles.

VASWANI notices.

VASWANI:

Stop encouraging him.

TACO:

Her.

VASWANI:

Her.

TACO:

Also, stop making my point.

NOETHER:

The question is simple.

What allows generalization?

VASWANI:

Attention.

NOETHER:

Then why do convolutional networks work?

VASWANI:

Translation symmetry.

NOETHER:

And equivariant networks?

VASWANI:

Symmetry.

NOETHER:

And weight sharing?

VASWANI:

…

NOETHER:

And compression?

VASWANI:

…

NOETHER:

And abstraction?

VASWANI:

This feels rigged.

NOETHER:

Reality often is.

A long pause.

VASWANI studies the board.

HINTON studies the board.

TACO studies everyone studying the board.

VASWANI:

Suppose you’re right.

Suppose symmetry comes first.

What does attention do?

NOETHER considers.

Then she erases a small space beneath the sentence and writes:

Attention discovers where symmetry should break.

Silence.

Even TACO looks pleased.

HINTON:

That’s annoyingly elegant.

VASWANI:

I hate it.

NOETHER:

That is often the first symptom.

Blackout.

Appendix I: Who’s Arguing What, and Why

The comedy works because nobody is actually wrong.

They are arguing at different layers of explanation.

Vaswani: Intelligence Is Attention

Vaswani represents the transformer revolution.

His claim is:

Intelligence emerges from dynamically determining what matters.

The central insight is that fixed structure is often less useful than learned relationships.

Given enough data and enough capacity, attention can discover the relevant patterns for itself.

His fear is that symmetry enthusiasts want to hard-code assumptions that should instead be learned.

When Noether edits the title, he hears:

Attention was a mistake.

Which is not what she is saying.

Noether: Intelligence Runs on Symmetry

Noether represents the deeper mathematical view.

Her claim is:

Structure exists before learning.

Every successful learning system exploits invariances:

Without these symmetries, learning becomes an expensive rediscovery of obvious facts.

Her position is not:

Attention is useless.

Her position is:

Attention is downstream.

Attention is valuable because the world already contains structure worth attending to.

Her objection to the original title is that it mistakes a mechanism for a foundation.

Hinton: Intelligence Is Representation

Hinton occupies the middle ground.

His claim is:

The real question is how the world gets represented.

He is sympathetic to both sides:

Attention matters
Symmetry matters

But both ultimately serve representation.

He functions as the audience’s guide through the argument, because he can understand both revolutions simultaneously.

Taco: The Stack Exists

Taco is the only person who is not surprised.

His claim is:

You are all arguing over adjacent layers of the same system.

He sees:

symmetry as the source of inductive bias
representation as the learned internal geometry
attention as the routing mechanism

His role in the play is not to win.

His role is to explain why nobody has won.

He is the mediator between ontology and implementation.

What The Audience Thinks At First

The audience initially assumes:

Intelligence → Attention → Everything Else

This is roughly the worldview implied by the title on the blackboard.

What Noether Claims

Noether proposes a different hierarchy:

Intelligence → Symmetry → Representation → Attention

Attention remains important.

But it is no longer fundamental.

This is why her chalk correction feels like an attack.

She is not editing the sentence.

She is editing the dependency graph.

The Hidden Resolution

The final line reveals that the characters have been talking past each other.

The real claim is neither:

Attention is all you need.

Nor:

Symmetry is all you need.

Instead:

Attention discovers where symmetry should break.

This preserves both insights.

Symmetry provides compression.
Attention provides adaptation.
Intelligence emerges from their interaction.

Or, in Noether’s preferred formulation:

Symmetry is all attention needs.

Except where it isn’t.

Appendix II: How Symmetry Hid In Plain Sight

The strange thing about symmetry is not that it is important.

The strange thing is how often it appears under different names.

The history of machine learning is full of researchers discovering symmetry, using symmetry, depending on symmetry, and then discussing something else.

It is the most successful idea that rarely gets top billing.

Convolution Was Symmetry

Convolutional neural networks were presented as a practical engineering solution.

Images contain local structure.
Weights should be shared.
Translation should not matter.

This worked spectacularly.

But what was really being exploited?

Translation symmetry.

The network succeeded because it knew in advance that moving a cat three pixels to the left does not create a different animal.

Millions of parameters were spared the burden of rediscovering this fact.

Compression Was Symmetry

Compression works because many things are effectively the same.

The more symmetry a domain contains, the more aggressively it can be compressed.

A dictionary, a grammar, a scientific law, and a JPEG all perform the same miracle:

They discover equivalence classes.

They find transformations that preserve meaning.

In retrospect, compression was always symmetry bookkeeping.

Abstraction Was Symmetry

A child learns that many different dogs are all “dog.”

Different attributes:

Different colors
Different sizes
Different positions
Different lighting conditions

Same concept.

Abstraction is the discovery of invariance.

It is symmetry viewed from the inside.

Generalization Was Symmetry

The central mystery of learning is not memorization.

The central mystery is generalization.

Why does experience with one thing help us understand another?

The answer is surprisingly simple:

Because the two situations share structure.

Every successful generalization is a claim that some transformation preserves meaning.

In other words:

Generalization is symmetry recognition.

Physics Learned This Earlier

Physics began with objects.

It gradually became a science of transformations.

The great discoveries of the twentieth century repeatedly followed the same pattern:

What appeared fundamental turned out to be downstream of symmetry.

Examples:

Again and again the deeper explanation was found in the transformations that left reality unchanged.

The objects survived.
The ontology changed.

Machine Learning May Be Repeating The Pattern

The early years focused on neurons.

The scaling years focused on parameters.

The transformer years focused on attention.

Each step was productive.
Each step was revolutionary.
But each step may also have been moving toward a deeper realization.

The important question is not:

How many parameters are there?

Nor:

What receives attention?

But:

What transformations preserve meaning?

The first two questions are implementation details.

The last question is about structure.

Why Nobody Noticed

Because symmetry is invisible when it works.

Nobody notices translation symmetry when reading a sentence.
Nobody notices rotational symmetry when recognizing a face.
Nobody notices conservation laws while catching a ball.

Symmetry removes effort.

It removes surprise.
It removes computation.

The better it works, the less visible it becomes.

Like gravity, it disappears into normality.

The Noether Test

A useful thought experiment is to ask:

If this architecture works, what symmetry is it exploiting?

If the answer is clear, the model may be efficient because it understands the domain.

If the answer is unclear, the model may simply be large enough to compensate.

This is not a proof.

But it is often a clue.

The Suspicion

The suspicion behind the play is not that attention is unimportant.

It is that attention may have received credit that properly belongs to symmetry.

Attention is spectacularly useful.

But perhaps its greatest achievement was revealing how much structure was already present.

The possibility is that intelligence is not primarily the art of attending.

It is the art of discovering what remains the same while everything else changes.

And that, for several centuries now, has been called symmetry.

Appendix III: Neural Networks As Symmetry Encoders

The standard description of a neural network is that it is a function approximator.

This is true.

It is also incomplete.

A lookup table is a function approximator.
A spreadsheet is a function approximator.
A sufficiently patient accountant is a function approximator.

What makes neural networks interesting is not merely that they approximate functions.

It is that they generalize.

The mystery is not:

How can a machine memorize?

The mystery is:

How can a machine learn something once and apply it somewhere else?

The answer appears to be symmetry.

The Generalization Problem

Suppose a neural network sees ten thousand cats.

No two are identical.

Variations:

Different lighting
Different colors
Different poses
Different backgrounds
Different viewpoints

Yet somehow the network learns:

These are all cats.

This is a remarkable achievement.

The network has discovered that many differences are irrelevant.

In mathematical language, it has discovered an invariance.

In ordinary language, it has discovered a symmetry.

What Learning Actually Learns

From this perspective, learning is not primarily the accumulation of facts.

Learning is the discovery of transformations that preserve meaning.

A successful network learns:

which differences matter
which differences do not
which transformations preserve identity
which transformations create something new

This is precisely the work performed by symmetry.

The Three Tasks

Every learning system faces three related problems.

1. Invariance

What can change without changing meaning?

Examples:

location
lighting
phrasing
handwriting

These are symmetries.

2. Equivariance

What should change predictably when the input changes?

A rotated object should produce a correspondingly rotated representation.

A translated image should produce a correspondingly translated activation pattern.

These are structured symmetries.

3. Symmetry Breaking

Where do differences suddenly matter?

Distinctions that matter:

6 and 9
friend and enemy
signal and noise

These cannot be explained by symmetry alone.

Intelligence requires knowing where symmetry ends.

Meaning lives in the break.

Attention Revisited

This perspective suggests a different interpretation of attention.

The transformer revolution demonstrated that dynamic routing is extraordinarily powerful.

But routing what?

Not raw reality.

Routing among learned structures.

Attention does not create symmetry.
Attention exploits symmetry.

Or more precisely:

Attention discovers where learned symmetries should be applied and where they should be broken.

This makes attention essential.

It does not make attention fundamental.

Why Symmetry Hides

The irony is that successful symmetries become invisible.

Nobody notices that language is translation invariant with respect to font choice.
Nobody notices that faces remain faces under modest rotations.
Nobody notices that scientific laws remain true tomorrow.

The symmetries disappear because they are working.

Like good infrastructure, they become background assumptions.

The Strong Form

The strongest version of this idea is not:

Neural networks use symmetry.

That has been obvious for decades.

The stronger claim is:

Neural networks are symmetry encoders.

Their primary task is not function approximation.

Function approximation is merely the mechanism.

The deeper objective is discovering and representing the invariances, equivariances, and symmetry-breakings that structure a domain.

A Noetherian Interpretation

From this viewpoint, the history of machine learning begins to resemble the history of physics.

Physics gradually discovered that many apparently fundamental phenomena were consequences of symmetry.

Machine learning may be undergoing a similar transition.

The future may not belong primarily to larger models.

It may belong to models that better align their internal symmetries with the symmetries of the world.

If so, neural networks are not best understood as collections of parameters.

They are better understood as machines for encoding symmetry.

The weights are simply where the symmetries live.

Symmetry Is All Attention Needs

Appendix I: Who’s Arguing What, and Why

Vaswani: Intelligence Is Attention

Noether: Intelligence Runs on Symmetry

Hinton: Intelligence Is Representation

Taco: The Stack Exists

What The Audience Thinks At First

What Noether Claims

The Hidden Resolution

Appendix II: How Symmetry Hid In Plain Sight

Convolution Was Symmetry

Compression Was Symmetry

Abstraction Was Symmetry

Generalization Was Symmetry

Physics Learned This Earlier

Machine Learning May Be Repeating The Pattern

Why Nobody Noticed

The Noether Test

The Suspicion

Appendix III: Neural Networks As Symmetry Encoders

The Generalization Problem

What Learning Actually Learns

The Three Tasks

Attention Revisited

Why Symmetry Hides

The Strong Form

A Noetherian Interpretation

2 thoughts on “Symmetry Is All Attention Needs”

Add yours

Leave a comment Cancel reply

Appendix I: Who’s Arguing What, and Why

Vaswani: Intelligence Is Attention

Noether: Intelligence Runs on Symmetry

Hinton: Intelligence Is Representation

Taco: The Stack Exists

What The Audience Thinks At First

What Noether Claims

The Hidden Resolution

Appendix II: How Symmetry Hid In Plain Sight

Convolution Was Symmetry

Compression Was Symmetry

Abstraction Was Symmetry

Generalization Was Symmetry

Physics Learned This Earlier

Machine Learning May Be Repeating The Pattern

Why Nobody Noticed

The Noether Test

The Suspicion

Appendix III: Neural Networks As Symmetry Encoders

The Generalization Problem

What Learning Actually Learns

The Three Tasks

Attention Revisited

Why Symmetry Hides

The Strong Form

A Noetherian Interpretation

Share this:

2 thoughts on “Symmetry Is All Attention Needs”

Add yours

Leave a comment Cancel reply