[Verse 1] In twenty-seventeen the paper dropped like thunder "Attention Is All You Need" tore RNNs asunder No more sequential chains that crawl from left to right Every token speaks to all in parallel delight Query, key, and value vectors dance in matrix space Scaled dot-product attention puts each word in place Softmax weights the relevance, then weighted sums emerge Self-attention mechanisms let the meanings converge [Chorus] Multi-head attention splits the representation Eight heads see different angles, rich interpretation Add and normalize the layers, feed-forward in between Positional encoding tells us where each token's been Transformer architecture, encoder-decoder strong But variants evolved beyond the original song [Verse 2] BERT reads bidirectional, encoder-only beast Masked language modeling makes predictions feast GPT flows left-to-right, decoder autoregressive Causal masks prevent the cheating, training so progressive T5 treats everything as text-to-text translation Encoder-decoder hybrid sparks new innovation From five-twelve context windows to millions now we scale Chinchilla laws guide compute so models never fail [Chorus] Multi-head attention splits the representation Eight heads see different angles, rich interpretation Add and normalize the layers, feed-forward in between Positional encoding tells us where each token's been Transformer architecture, encoder-decoder strong But variants evolved beyond the original song [Bridge] Sinusoidal waves or learned embeddings show position RoPE rotates the queries, ALiBi's new tradition Flash Attention speeds the quadratic memory curse KV caching stores the past so inference won't rehearse Mixture of Experts routes to specialized domains Emergent abilities bloom when scale remains From nanoGPT implementations to giants in the cloud The attention mechanism makes intelligence loud [Outro] Layer norm stabilizes gradients through the stack Residual connections help the information track back Scaled attention is the secret, parallelism the key Transformers changed everything in AI's symphony
โ Unit 4.1 โ NLP Foundations | Unit 4.3 โ Working with Large Language Models โ