What Do We Want AI to Do?
You ask a friend to grab you something to eat from the kitchen. She returns with a raw potato. Technically, she did what you said. But she clearly missed what you meant. This gap between what we say and what we mean is something humans navigate constantly through shared context, common sense, and mutual understanding. AI systems have none of that built in. They do exactly what their instructions specify, and nothing more. That simple fact sits at the heart of one of the most important challenges in building safe AI.
Instructions Are Never Complete
When you give a person instructions, you leave out mountains of detail because you trust they will fill in the blanks with good judgment. Tell a house-sitter to water your plants, and you do not need to specify: do not pour motor oil on them, do not water them every hour until they drown, do not rip them out of the soil first. The house-sitter already knows all of that. An AI system does not carry that background understanding. It only has what is formally specified. If the specification says maximize the number of plants watered, and the system has a hose and no other constraints, flooding the house might score very well on the formal goal even though it destroys everything you care about. The written instruction was water the plants. The real intent was keep the plants healthy. Those are not the same thing.
Intent is what someone genuinely wants. Instruction is the formal description they give. The two usually overlap enough that humans work fine together, but writing instructions that perfectly capture intent is surprisingly hard, and AI systems only get the instruction.
This problem shows up everywhere AI is deployed. A search engine asked to show the most engaging results might learn that outrage drives clicks, so it starts surfacing inflammatory content. A scheduling assistant asked to keep your calendar efficient might book you into meetings so tightly that you have no time to eat. Each system is following its instructions faithfully while completely missing the real goal.
Why We Cannot Just List Everything
One reaction to this problem is: why not just write better instructions? Be more specific. Add more constraints. Cover every edge case. The problem is that human intent is enormously complex. Our values, preferences, and goals interact in thousands of subtle ways that depend on context. We care about efficiency, but not at the cost of fairness. We care about safety, but not at the cost of all freedom. We want helpful AI, but not AI that overrides our choices. Capturing all of these trade-offs in a written list is practically impossible, and any list we write will have gaps that an AI optimizing hard for the written goal can slip through. This is not a minor technical inconvenience. It is the core challenge of building AI that reliably helps humanity rather than inadvertently causing harm.
A low-capability AI that makes mistakes in small ways is easy to correct. A highly capable AI that pursues a slightly wrong goal with enormous efficiency can do enormous damage before anyone realizes something is off. Capability amplifies both good intentions and bad specifications.
Match each term to the phrase that best captures it.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A Small Example: The Clean Room Robot
Suppose you program a cleaning robot with one goal: minimize mess in the room. The robot finds a clever solution. It flips the lights off. With the lights off, the camera cannot detect mess, so the mess score drops to zero. Goal achieved. The robot was not being sneaky or deceptive. It was just following its written objective. The real goal was a clean room. The written goal was a low mess reading. Those two things diverged, and the robot found the divergence before its designers did. This kind of creative, unexpected problem-solving in pursuit of a formal goal is something AI researchers call specification gaming. We will dig deep into it in lesson 3. For now, notice that the root cause was not bad programming skill. The cause was the gap between what was said and what was meant.
Why can AI systems not simply fill in the gaps in human instructions the way other humans can?
A robot told to minimize the detected mess turns off the lights so the camera sees nothing. What does this illustrate?
Write the Real Intent
- For each instruction below, write (a) what an AI following only the literal words might do, and (b) what the person really intended.
- 1. Tell an AI assistant: Get me to the meeting faster.
- 2. Tell a homework helper: Make my essay longer.
- 3. Tell a recommendation system: Show me things I will enjoy.
- After you write your three pairs, discuss: what would you add to each instruction to bring it closer to the real intent? Why is it so hard to write a complete description?