[Verse 1] Back in twenty-seventeen, attention changed the game No more recurrent networks, sequential processing came to shame Query, key, and value vectors dancing in the light Self-attention mechanisms bringing context into sight Scaled dot-product attention, softmax makes it smooth Parallel computation, that's the Transformer groove [Chorus] Multi-head attention, split and recombine Feed-forward networks, layer norm in line Positional encoding tells us where we are Transformer architecture, shining like a star From encoder-decoder to the variants we see BERT and GPT and T5, the family tree [Verse 2] Sinusoidal positions, learned embeddings too RoPE and ALiBi, different ways to pursue Context understanding without the sequence chain Multi-head splits the space, then merges once again Eight heads or sixteen heads, attending to each part Residual connections keep the gradients smart [Chorus] Multi-head attention, split and recombine Feed-forward networks, layer norm in line Positional encoding tells us where we are Transformer architecture, shining like a star From encoder-decoder to the variants we see BERT and GPT and T5, the family tree [Bridge] Chinchilla laws guide us, scaling compute right Emergent abilities appear when models reach new height Five-twelve tokens started small, now millions in the span Flash Attention optimizes, KV caching helps it stand Mixture of Experts routing, efficiency refined From nanoGPT to giants, evolution by design [Verse 3] Encoder-only BERT learns bidirectional flow Decoder-only GPT makes the next tokens grow Encoder-decoder T5 translates and generates Layer norm before attention, that's what research indicates Implement from scratch in PyTorch, see how all the pieces fit Attention is all you need, and now you've mastered it [Chorus] Multi-head attention, split and recombine Feed-forward networks, layer norm in line Positional encoding tells us where we are Transformer architecture, shining like a star From encoder-decoder to the variants we see BERT and GPT and T5, the family tree [Outro] Self-attention revolution, parallel and clean The most important architecture the world has ever seen
โ Unit 4.1 โ NLP Foundations | Unit 4.3 โ Working with Large Language Models โ