Unit 2.3 — Reinforcement Learning

acid techno, korean afro-funk · 4:12
Lyrics

[Verse 1]
An agent wanders through environments unknown
Taking actions while the state machine's thrown
Each decision yields a numerical prize
Reward signals guide where wisdom lies
Markov processes hold the golden key
Future depends on present, not history

[Chorus]
RL framework spinning round and round
Agent, environment, reward profound
Policy maps the states to actions clear
Bellman equations make the path appear
Model-free or model-based you choose
Q-learning wins when optimal's what you cruise

[Verse 2]
Value functions estimate the future gain
Q-tables store what actions will obtain
SARSA learns on-policy's steady beat
While Q-learning off-policy can't be beat
Deep networks handle spaces way too vast
DQN bridges neural futures with the past

[Chorus]
RL framework spinning round and round
Agent, environment, reward profound
Policy maps the states to actions clear
Bellman equations make the path appear
Model-free or model-based you choose
Q-learning wins when optimal's what you cruise

[Bridge]
Epsilon-greedy balances the scale
Exploration versus exploit without fail
UCB algorithms pull the bandit's arm
REINFORCE gradients work their policy charm
RLHF aligns the language models tight
Human feedback keeps the outputs right

[Verse 3]
CartPole swinging in gymnasium's frame
Training agents in the balance game
Robotics, games, and recommendations flow
Resource allocation makes the profits grow
Temporal difference learning updates each step
Neural networks keep the knowledge deeply kept

[Chorus]
RL framework spinning round and round
Agent, environment, reward profound
Policy maps the states to actions clear
Bellman equations make the path appear
Model-free or model-based you choose
Q-learning wins when optimal's what you cruise

[Outro]
From bandits to the deepest neural maze
Reinforcement learning lights tomorrow's ways
← Unit 2.2 — Unsupervised Learning | Unit 2.4 — ML Engineering Best Practices →