[Verse 1]
Start with queries, keys, and values in your hand
Dot product matrix multiply, the neural pathways planned
Each query finds its matching key through multiplication's dance
Transpose and calculate the weights, give memories their chance
But here's the twist that makes it work, divide by square root dee-kay
Without this scaling factor friend, your gradients blow away
[Chorus]
Attention mechanism, scaled dot-product style
Query times key transpose, softmax with a smile
Divided by the square root, keeps the variance tight
Content-addressable lookup, patterns burning bright
From Hopfield to transformers, same retrieval game
Different math, same magic, neural networks claim their fame
[Verse 2]
Code it from the ground up, matrix operations clean
Temperature controls the sharpness of your softmax scene
When dee-kay grows enormous, watch the chaos unfold
Gradients explode like fireworks, story's getting old
That's why we scale by square root, mathematical salvation
Keeps the logits well-behaved across each computation
[Chorus]
Attention mechanism, scaled dot-product style
Query times key transpose, softmax with a smile
Divided by the square root, keeps the variance tight
Content-addressable lookup, patterns burning bright
From Hopfield to transformers, same retrieval game
Different math, same magic, neural networks claim their fame
[Bridge]
Hopfield networks store and retrieve through energy descent
Attention heads accomplish this with gradients well-spent
Each head learns specialized patterns, syntax or semantics
Probe the trained transformer weights, see linguistic acrobatics
Some heads track syntactic structure, others semantic meaning
Content-addressable memory with purpose intervening
[Verse 3]
Empirical investigation shows what heads have learned
Each layer captures different traits, patterns they've discerned
Visualize attention maps, see where focus lands
Specific linguistic phenomena, coded by trained hands
Position encoding, word relationships, grammar rules encoded
Distributed computation where intelligence is loaded
[Verse 4]
Multi-head attention splits the load across dimensions
Parallel processing power with focused intentions
Each head gets its slice of hidden state representation
Different perspectives combining through concatenation
Like a symphony orchestra, each section plays its part
Mathematics orchestrating the computational art
[Chorus]
Attention mechanism, scaled dot-product style
Query times key transpose, softmax with a smile
Divided by the square root, keeps the variance tight
Content-addressable lookup, patterns burning bright
From Hopfield to transformers, same retrieval game
Different math, same magic, neural networks claim their fame
[Outro]
Cortical columns computing, attention heads aligned
Mathematics of the mind