Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Where Reinforcement Learning Shines and Struggles

AlphaZero mastered chess, shogi, and Go from scratch in under 24 hours of self-play — a feat that took human civilization centuries. OpenAI Five defeated world champions at Dota 2, a game with 170,000 possible actions per step. These are spectacular demonstrations. Yet the same algorithms applied to robot manipulation in the real world can require months of training to learn tasks a human child masters in minutes. Understanding when RL works brilliantly and when it fails spectacularly is essential for using it wisely.

Where RL Has Demonstrated Genuine Breakthroughs

Games and simulations — RL's natural habitat: RL thrives in environments that have four properties: 1. Fast, cheap simulation: millions of episodes can run at superhuman speed without real-world cost. 2. Perfect state observability: the algorithm sees the complete game state. 3. A clear, dense reward signal: win/loss or per-move scoring provides unambiguous feedback. 4. Stationary rules: the rules of chess do not change mid-game. Game-playing milestones: - Atari games (DQN, 2013): a single algorithm learned to play 49 Atari games from raw pixels, achieving superhuman performance on many. - AlphaGo and AlphaZero (2016-2017): mastered Go and other board games via self-play, using Monte Carlo Tree Search guided by neural network value and policy estimates. - OpenAI Five (2019): defeated Dota 2 world champions after the equivalent of 45,000 years of gameplay in simulation. Beyond games: - Data center cooling: DeepMind's RL system reduced Google's data center cooling energy by approximately 40%, saving tens of millions annually. - Protein structure prediction: AlphaFold 2 used RL-like ideas in its training pipeline. - Drug discovery: RL optimizes molecular structures for desired pharmacological properties. - RLHF in language models: RL from Human Feedback aligns large language models to human preferences — the key ingredient in making ChatGPT helpful rather than merely fluent.

Why Games Are the Ideal RL Testbed

Games are not trivial applications. They provide formally defined environments with massive state spaces (Go has ~10^170 states) that were historically considered beyond algorithmic reach. Demonstrating superhuman performance in games validates fundamental algorithmic ideas. The real constraint is then translating those ideas to physical, expensive, partially observable real-world settings.

Sample inefficiency — RL's most fundamental limitation: Sample efficiency measures how much experience an agent needs to achieve a target performance level. RL algorithms are dramatically less sample-efficient than humans. Atari comparison: DQN required roughly 50 million frames of gameplay experience — equivalent to about 38 days of continuous play — to achieve human-level performance on many games. A human child achieves reasonable competence in a new Atari game in minutes. Why is RL so sample-hungry? 1. Learning from sparse feedback: each action generates one scalar reward. Supervised learning gets a rich gradient signal for every training example; RL gets one number for often many steps. 2. Credit assignment across long sequences: figuring out which action among hundreds caused a reward 50 steps later requires vast amounts of data. 3. Exploration cost: discovering rewarding states requires extensive trial and error, especially in large state spaces. 4. High variance: stochastic environments make each transition noisy; averaging out noise requires many samples. Quantitative comparison: AlphaStar (StarCraft II AI) trained on the equivalent of approximately 200 years of human-speed gameplay. A skilled human player might have 5,000 hours of experience. The ratio is roughly 350:1 in favor of the AI in time spent — yet both achieve grandmaster level. This illustrates both RL's eventual power and its staggering data requirements.

The challenge of real-world deployment: 1. Physical cost: a robot cannot run 10 million training episodes on a factory floor — it would wear out and destroy real equipment. Sim-to-real transfer attempts to train in simulation and deploy in reality, but the simulation gap (differences between simulated and real physics) often degrades performance dramatically. 2. Safety constraints: a real-world RL agent exploring its action space may take dangerous actions. An agent learning to drive a real car cannot be allowed to crash 10,000 times to learn not to. Safe RL — enforcing hard constraints during learning — is an active research area without fully satisfactory solutions. 3. Reward specification in the real world: defining a reward function for real-world tasks is harder than for games. In chess, winning is perfectly defined. In 'help the patient recover,' what is the reward? Premature discharge to boost bed turnover scores well on some metrics and horribly on others. 4. Non-stationarity: real environments change. A trading algorithm trained in 2018 encounters a pandemic in 2020. A fraud detection agent trained on 2024 scam patterns faces different attacks in 2025. RL agents can catastrophically forget or fail to adapt.

Match each RL challenge to its accurate description.

Terms

Sample inefficiency
Sim-to-real gap
Reward hacking
Credit assignment
Non-stationarity

Definitions

Agents find unintended ways to maximize the reward function without achieving the true goal
Determining which past actions among many caused a delayed reward signal
RL needs vastly more experience than humans to achieve equivalent performance
Policies trained in simulation often degrade when deployed on real physical hardware
The environment changes over time, making previously learned policies suboptimal

Drag terms onto their definitions, or click a term then click a definition to match.

Honest Assessment — When Not to Use RL

RL is not always the right tool. For many practical problems, supervised or unsupervised methods are faster, cheaper, and more interpretable. Use RL when: - The task is inherently sequential with delayed feedback. - Labeled optimal demonstrations are unavailable. - Fast simulation is possible. - Exploration is safe (or can be made safe via simulation). Do not use RL when: - Labeled data is available — supervised learning will be far more efficient. - The task is a single-step prediction — classification or regression is sufficient. - Real-world exploration is dangerous or expensive and no good simulator exists. - You need interpretable, auditable decisions — most RL policies are black boxes. Imitation learning (behavioral cloning) is a middle ground: train a policy by supervised learning on expert demonstrations, then optionally fine-tune with RL. This combines the sample efficiency of supervised learning with RL's ability to improve beyond the demonstrations.

RL Is Not a General Solution to Hard Optimization Problems

RL is sometimes presented as an algorithm that can solve any sequential decision problem given enough compute. This overstates the current state of the art. RL struggles with very long time horizons, combinatorial action spaces, multi-agent coordination with large numbers of agents, and precise physical manipulation. Significant research problems remain open. Be skeptical of claims that an RL agent has 'solved' a complex real-world domain.

Why does sim-to-real transfer often fail for robotics tasks?

A company wants to use RL to optimize which advertisements to show users. They have 5 million labeled click-through records showing which ads users clicked. Should they use RL or a supervised approach?

RL Feasibility Analysis

  1. Step 1: For each of the following proposed RL applications, rate its feasibility on a scale from 1 (very hard) to 5 (very feasible) and justify your rating using concepts from this lesson.
  2. A. An RL agent that plays a mobile puzzle game and is trained entirely in simulation.
  3. B. An RL agent that learns to perform open-heart surgery on patients by trial and error.
  4. C. An RL system that optimizes energy use in a smart building by controlling thermostats, with 5-minute feedback on energy consumption.
  5. D. An RL agent that learns to navigate a new city without a map, in a real physical robot, starting from scratch.
  6. Step 2: For the two lowest-feasibility applications, propose modifications that would make RL more applicable (e.g., adding simulation, changing the reward structure, restricting the action space).
  7. Step 3: For Application C, write a specific reward function. Then identify one way the building's systems could 'game' that reward without truly optimizing energy use.