Unit 4.2 — The Transformer Architecture

barbershop balkan brass band, ambient house p-funk, coptic flamenco, alt-country disco · 4:47
Lyrics

[Verse 1]
In twenty-seventeen the paper dropped like thunder
"Attention Is All You Need" tore RNNs asunder
No more sequential chains that crawl from left to right
Every token speaks to all in parallel delight

Query, key, and value vectors dance in matrix space
Scaled dot-product attention puts each word in place
Softmax weights the relevance, then weighted sums emerge
Self-attention mechanisms let the meanings converge

[Chorus]
Multi-head attention splits the representation
Eight heads see different angles, rich interpretation
Add and normalize the layers, feed-forward in between
Positional encoding tells us where each token's been

Transformer architecture, encoder-decoder strong
But variants evolved beyond the original song

[Verse 2]
BERT reads bidirectional, encoder-only beast
Masked language modeling makes predictions feast
GPT flows left-to-right, decoder autoregressive
Causal masks prevent the cheating, training so progressive

T5 treats everything as text-to-text translation
Encoder-decoder hybrid sparks new innovation
From five-twelve context windows to millions now we scale
Chinchilla laws guide compute so models never fail

[Chorus]
Multi-head attention splits the representation
Eight heads see different angles, rich interpretation
Add and normalize the layers, feed-forward in between
Positional encoding tells us where each token's been

Transformer architecture, encoder-decoder strong
But variants evolved beyond the original song

[Bridge]
Sinusoidal waves or learned embeddings show position
RoPE rotates the queries, ALiBi's new tradition
Flash Attention speeds the quadratic memory curse
KV caching stores the past so inference won't rehearse

Mixture of Experts routes to specialized domains
Emergent abilities bloom when scale remains
From nanoGPT implementations to giants in the cloud
The attention mechanism makes intelligence loud

[Outro]
Layer norm stabilizes gradients through the stack
Residual connections help the information track back
Scaled attention is the secret, parallelism the key
Transformers changed everything in AI's symphony
← Unit 4.1 — NLP Foundations | Unit 4.3 — Working with Large Language Models →