Gradient descent

acoustic, folk, soulful, warm
Lyrics

[Verse 1]
Started with a function, peaks and valleys scattered wide
Loss landscape stretching out, nowhere for errors to hide
Pick a random spot to land, that's initialization
Compute the slope beneath your feet, that's differentiation
Negative gradient points the way to lower elevation
Take a step in that direction, call it optimization
Learning rate controls your stride, too big you'll overshoot
Too small and you'll crawl for days, gotta find that sweet pursuit

[Chorus]
Descend, descend, follow the slope down
Gradient vector shows you which way to go
Step size matters, don't jump around
Converge to minimum, watch that loss flow
Partial derivatives, chain rule bound
Backprop feeding signals to and fro
Descend, descend, till optimal's found

[Verse 2]
Stochastic brings the noise, mini-batches keep it lean
Instead of full dataset, just a sample in between
Momentum builds velocity, smooths out the jagged path
Exponential moving average helps you avoid the wrath
Of saddle points and plateaus where gradients disappear
Adam optimizer adapts, keeps your progress clear
Learning rate decay schedules, start fast then take it slow
Batch normalization helps the signals smoothly flow

[Chorus]
Descend, descend, follow the slope down
Gradient vector shows you which way to go
Step size matters, don't jump around
Converge to minimum, watch that loss flow
Partial derivatives, chain rule bound
Backprop feeding signals to and fro
Descend, descend, till optimal's found

[Bridge]
Local minimum traps you, global's what you seek
Convex functions guarantee the peak you'll never meet
Non-convex landscapes hide multiple solutions deep
Random restarts help you find the valley you can keep

[Verse 3]
Weight decay adds penalty, keeps parameters in check
L2 regularization prevents the model wreck
Gradient clipping saves you when explosions start to build
Learning rate schedules and warm restarts keep you skilled
Convergence criteria tells you when the work is done
Tolerance thresholds signal that the race is finally won

[Chorus]
Descend, descend, follow the slope down
Gradient vector shows you which way to go
Step size matters, don't jump around
Converge to minimum, watch that loss flow
Partial derivatives, chain rule bound
Backprop feeding signals to and fro
Descend, descend, till optimal's found

[Outro]
From random initialization to convergence tight
Gradient descent guides neural networks through the night
← RSA key generation basics | Backpropagation →