What Reinforcement Learning Is
In 2016, DeepMind's AlphaGo defeated Lee Sedol — one of the world's best Go players — in a five-game match. Go has more legal positions than there are atoms in the observable universe. No one could label every board state with the 'correct' move. AlphaGo was not trained by telling it what to do at each step. It was trained through millions of games of trial, error, and feedback — the core of reinforcement learning. RL is the third major machine learning paradigm, and in many ways the most ambitious: learning not from examples but from consequences.
The Trial-Error-Reward Loop
Reinforcement learning is defined by a specific structure of interaction: 1. An agent exists in an environment. 2. At each time step t, the agent observes the current state sₜ of the environment. 3. Based on its state, the agent selects an action aₜ. 4. The environment transitions to a new state sₜ₊₁ and emits a scalar reward signal rₜ. 5. The agent uses this feedback to update its behavior — trying to act in ways that maximize cumulative future reward. This loop is repeated across many time steps and episodes. The key: reward comes after the fact. The agent does not receive instruction ('do this in this situation') — it receives evaluation ('here is how much what you just did was worth'). The challenge is then attributing credit: which of the many actions taken before a reward signal arrived actually caused it? This credit assignment problem — determining which past actions deserve credit for a delayed reward — is the central technical challenge of reinforcement learning.
Supervised learning: the correct output is provided for each input. The model imitates. Unsupervised learning: no labels, no feedback — just structure to discover. Reinforcement learning: feedback exists (reward) but it is delayed, sparse, and only evaluative — not instructive. The agent must discover what to do through exploration. This is fundamentally the structure of learning to ride a bicycle, play chess, or run a business.
A concrete analogy: teaching a dog to sit. You do not hand the dog a manual describing the correct muscle movements. You say 'sit,' the dog does something, and if it sits you give a treat (positive reward). If it does nothing or jumps, no treat. Over many repetitions, the dog associates the action 'sit' in the context of hearing the command with the reward of a treat. In RL terms: - Environment: the room, including you and the command you issue. - State: the current situation (the command 'sit' just spoken). - Agent: the dog. - Action: what the dog does (sit, stay standing, jump). - Reward: treat (+1) or no treat (0). - Policy: the dog's learned behavior — what action to take given the current command. The dog does not receive labels saying 'the correct action here is sit.' It receives evaluations of the actions it actually chose.
Contrast with supervised learning: In supervised learning for image classification, the training set contains (image, label) pairs — the correct category for every image is known. The loss function tells the model exactly how wrong each prediction was and propagates the error back through the network via gradient descent. In reinforcement learning for a chess-playing agent, there is no label for every board position. The agent plays a full game (potentially hundreds of moves), receives a single reward at the end (+1 for win, -1 for loss, 0 for draw), and must somehow distribute that terminal signal back across all the moves that led to it. Was move 47 good or bad? The reward from move 248 (the game's end) is the only signal, and it is entangled with the effects of all intervening moves. This is why RL algorithms typically need vastly more experience than supervised algorithms to achieve comparable performance — they are solving a much harder inference problem.
Prompt Challenge
Write a paragraph that explains reinforcement learning to someone who has only heard of supervised learning. Use a concrete real-world analogy and explicitly identify the agent, environment, action, and reward in your analogy.
Your prompt should…
- Explain what makes RL different from supervised learning in one precise sentence
- Identify a concrete real-world analogy and name its agent, environment, action, and reward explicitly
- Acknowledge one key difficulty of RL such as delayed reward or the credit assignment problem
What RL Can Learn That Other Methods Cannot
Reinforcement learning is uniquely suited to problems with three properties: 1. Sequential decision-making: the outcome depends on a series of actions, not a single prediction. Games, robot navigation, supply-chain management. 2. No labeled optimal solution: no human expert has enumerated the right action for every possible state. In Go, the state space is astronomically large; human expert labels would never cover it. 3. Interactive environment: the agent's actions change the environment, which generates new observations that inform future actions. The agent shapes its own training data through its choices — a form of active learning not possible in passive supervised settings. Applications already deployed: game-playing AI (chess, Go, Atari games, StarCraft II), robot locomotion and manipulation, drug dosing optimization in ICUs, personalized recommendation systems, data center cooling control (DeepMind reduced Google's data center cooling energy by 40%), and large language model fine-tuning via RLHF (Reinforcement Learning from Human Feedback).
The ChatGPT you use was not trained purely by predicting text. It was fine-tuned using Reinforcement Learning from Human Feedback (RLHF): humans rated which AI responses they preferred, a reward model was trained on those ratings, and RL was used to steer the language model toward higher-rated responses. RL is thus a key ingredient in making LLMs helpful rather than merely fluent.
An RL agent plays 10,000 games of chess and adjusts its behavior based on win/loss outcomes. Which statement best describes why this differs from supervised learning?
What is the credit assignment problem in reinforcement learning?
Identify the RL Structure
- Step 1: For each of the following scenarios, identify: (a) the agent, (b) the environment, (c) what the actions are, (d) what the reward signal is, and (e) what state information the agent observes.
- Scenario A: An AI controlling a traffic light at an intersection to minimize average wait time.
- Scenario B: A recommendation system choosing which video to show a user next, optimizing for watch time.
- Scenario C: A robot arm learning to assemble a circuit board component by component.
- Step 2: For each scenario, identify whether the reward is immediate (given right after each action) or delayed (only given after many actions). How does delay affect the difficulty of learning?
- Step 3: For Scenario A, describe one action that might earn a high reward in the short term but cause congestion 10 minutes later. What does this reveal about reward design?