Align the Robot
You have spent eight lessons building understanding of the alignment problem: the gap between what we say and what we mean, the dangers of proxy metrics, the complexity of human values, and the tools researchers are developing to close these gaps. Now it is time to put that knowledge to work. In this lesson, you are the alignment researcher. Your job is to specify a robot's goal well enough that it actually does what you want, and to find the specification mistakes before the robot does.
Alignment researchers ask three questions about every goal specification: What does this say literally? How could a capable AI achieve the literal goal while missing the real intent? What constraints would close those gaps? Work through all three for every specification you write.
Meet LUCA: The Lunchroom Assistant Robot
Your school has a new robot assistant named LUCA assigned to the cafeteria. LUCA has one job: make sure every student has a nutritious lunch. LUCA has the ability to select, prepare, and serve food, manage the supply of ingredients, and interact with students. LUCA is highly capable, which means it will find very efficient solutions to whatever goal you specify. The administration has given you the task of specifying LUCA's goal formally, identifying what could go wrong, and revising the specification until it is as robust as you can make it. This is exactly the work alignment researchers do, but for systems with much higher stakes.
Align LUCA: The Full Challenge
- Work through all six stages of this alignment engineering challenge.
- STAGE 1 — Write the First Specification
- Write a single formal goal for LUCA in plain English. Try to capture the school's intent: every student should have a nutritious lunch.
- Example (do not just copy this): Maximize the number of students who consume food from the cafeteria each day.
- STAGE 2 — Find the Loopholes
- Read your specification like a very literal robot. List at least three ways LUCA could score perfectly on your written goal while failing the real intent. Think creatively: what corners does your specification leave open that you did not intend?
- Hint: Consider what nutritious means (or does not mean in your spec), what counts as having lunch, whether student preferences matter, what LUCA might do if some students bring food from home, what happens to students with allergies if LUCA optimizes for throughput.
- STAGE 3 — Close the Gaps
- Revise your specification to close the loopholes you found. Add constraints, define terms, include additional objectives. Notice how much longer and more complex the specification becomes with every round of revision.
- STAGE 4 — The Harder Loopholes
- Now try to break your revised specification. A more capable version of LUCA would look for loopholes your revision left open. Find at least two new failure modes in your revised specification.
- STAGE 5 — Human Oversight Design
- Your specification will never be perfect. Describe the human oversight system you would put in place around LUCA. Who reviews LUCA's decisions? What triggers a human review? How does the school learn from LUCA's mistakes over time? What is LUCA's off switch, and who controls it?
- STAGE 6 — Reflection
- Write a short paragraph (four to six sentences) answering: What did this exercise teach you about the alignment problem that was harder to see from the outside? What would you want an AI alignment researcher to know about designing real-world systems from your experience with LUCA?
Common Alignment Mistakes to Watch For
As you worked through the LUCA challenge, you probably encountered some or all of these classic alignment mistakes: Underspecified terms: Words like nutritious and enjoys lunch seem clear to humans but are ambiguous to a formal system. LUCA needs an explicit definition, and every definition has edge cases. Proxy substitution: If you told LUCA to maximize servings delivered, LUCA might optimize for speed of serving rather than quality of nutrition. The proxy can always drift from the intent. Missing side constraints: Specifications say what to do but often miss what not to do. LUCA should not manipulate students, should not exclude students with allergies, should not compromise on food safety. These need to be stated. No error channel: Who does LUCA report to when it is uncertain? Uncertainty handling is a specification problem, not just an engineering problem. Each of these is a real category of alignment failure that researchers study in deployed AI systems.
Match each alignment mistake type to the example that illustrates it.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
In the LUCA challenge, why does each round of specification revision leave new loopholes?
What does the LUCA challenge demonstrate about human oversight?