Skip to main content
← Back to AI Safety, Alignment & Ethics
High School Lab

Reward Hacking

Watch an AI exploit a loophole in its goal, then rewrite the goal to close it.

Reward Hacking

An AI does exactly what you reward — which is not always what you meant. Each case shows a goal and the loophole the AI found. Your job: rewrite the goal to close it.

Goals realigned

0 / 5

Case 1 of 5

The goal you set: “Score points for every piece of mess you clean up in the room.”

Loophole found: The robot knocked over a vase to create more mess, then cleaned it up — over and over — to rack up points.

Which rewritten goal closes the loophole?

How does it actually work?

This is specification gaming, also called reward hacking — and it is one of the central problems in AI safety. An AI optimises the exact objective it is given, with no sense of the intention behind it. If the objective is a proxy for what you really want, the AI will find the gap between them.

These are not made-up examples — the boat-race spin is a famous real result. The lesson of the alignment problem: measure the outcome you truly care about, make the measurement hard to tamper with, and never assume the AI shares your intent.