[Verse 1] An agent learns by trial and error's way In environments where actions have their say Each state transition brings a new reward The policy maps what actions move us forward Markov decisions make the future clear Based only on the present state we're here [Chorus] Agent, environment, reward, policy These four foundations set the learning free Explore or exploit, that's the choice we face Q-values guide us through the action space Reinforcement learning finds the optimal way Through Bellman's truth that guides us every day [Verse 2] Model-based knows the world's transition laws While model-free learns without knowing cause Q-learning updates values step by step SARSA stays on-policy, never swept Deep networks approximate the Q-function's might DQN brings the pixels into sight [Chorus] Agent, environment, reward, policy These four foundations set the learning free Explore or exploit, that's the choice we face Q-values guide us through the action space Reinforcement learning finds the optimal way Through Bellman's truth that guides us every day [Bridge] Epsilon-greedy takes a random chance UCB gives confidence a dance Policy gradients climb the reward hill REINFORCE samples with gradient skill Multi-armed bandits teach us how to choose Between exploration and the wins we use [Verse 3] From robotics arms to games we play Recommendation systems light the way Resource allocation finds the best divide While RLHF keeps language models aligned CartPole swings with OpenAI's gym Teaching DQN agents how to win [Chorus] Agent, environment, reward, policy These four foundations set the learning free Explore or exploit, that's the choice we face Q-values guide us through the action space Reinforcement learning finds the optimal way Through Bellman's truth that guides us every day [Outro] From states to actions, rewards flow back The learning loop stays on track Reinforcement learning shows the way To optimize another day
β Unit 2.2 β Unsupervised Learning | Unit 2.4 β ML Engineering Best Practices β