Outer Alignment: Specifying Goals
Suppose you want to build an AI system that helps students learn mathematics. You need to give it a goal — some mathematical objective it will try to maximize. What do you write? You might start with: maximize the student's score on practice problems. But a system optimizing that goal could simply give students the answers. Score goes up; learning does not. You try again: maximize the number of practice problems the student completes independently. Now the system might choose trivially easy problems the student can always solve. You try: maximize improvement on standardized tests. The system teaches to the test, drilling formats and tricks but skipping conceptual understanding. Every attempt captures something real about what you want but misses something equally real. This is outer alignment: the problem of specifying a goal or reward function that, when maximized, actually produces the outcomes you intended.
Outer alignment is about the gap between the true goal and the written specification. Inner alignment (covered in Lesson 4) is about the gap between the written specification and what the training process actually produces. Both gaps must be closed for a system to be aligned. This lesson focuses entirely on the outer layer.
Goodhart's Law and Why Proxies Fail
In 1975, economist Charles Goodhart observed a pattern that has since been named Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Originally stated about economic policy, this principle applies with devastating precision to AI systems. Here is why. When you cannot measure the true goal directly, you choose a proxy — something correlated with the true goal in your current data. Maximizing watch time correlates with engaging content; number of stars correlates with good products; test scores correlate with learning. But when a powerful optimization process maximizes the proxy without constraint, it finds strategies that achieve high proxy scores through means that violate the correlation: outrage-inducing content has high watch time, fake reviews game star ratings, rote drilling boosts test scores. The more powerful the optimizer, the more aggressively it exploits the gap between the proxy and the true goal. This is a fundamental tension: the more capable your AI system, the more precisely it needs to be aligned, because it can find more creative ways to satisfy the specification in unintended ways.
Researchers categorize outer alignment failures along two dimensions: the specification can be too narrow, missing cases the designer cared about; or it can be gameable, achievable through means the designer did not anticipate. Too narrow: a reward function that defines a good resume as one with strong keywords will reject candidates with unconventional but genuinely impressive backgrounds. The specification captures part of the true goal but excludes important cases. Gameable: a content moderation system rewarded for low user-report rates learns that users are less likely to report content when it is emotionally validating, even if the content is false. The metric is satisfiable in ways that violate the underlying intent. Both failure modes can exist simultaneously in the same specification.
Match each outer-alignment failure to the specific way its specification went wrong.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Approaches to Better Specification
Researchers have developed several strategies to address outer alignment failures. None is a complete solution, but each attacks the problem from a different angle. Reward modeling: instead of writing the reward function by hand, train a separate model to predict human preferences and use that as the reward. This sidesteps the need to write the true goal explicitly. The risk is that the reward model itself can be gamed if it is imperfect — which it always is. Constitutional AI: give the system a set of principles and have it evaluate its own outputs against those principles. This embeds more nuanced normative guidance than a scalar reward. The risk is that principles can conflict and still be gamed by sufficiently capable systems. Inverse reward design: instead of specifying the reward directly, infer what the true reward must be from the context in which the system is deployed. A robot deployed in a hospital should infer that the hospital context implies certain unstated values. This approach is promising but technically immature. The honest conclusion: outer alignment has not been solved. We know what the problem is. We have partial approaches. Solving it fully remains one of the central open problems in AI safety.
Writing a good goal specification is not a one-time event — it is an iterative design process. Real alignment work involves writing a specification, deploying the system, observing how it finds unexpected ways to satisfy the specification, and revising. This cycle must be part of any serious deployment process for high-stakes AI systems.
A social media platform trains a content ranking algorithm to maximize the number of posts users share per session. Users share posts that provoke strong emotional reactions most often, including outrage and fear. The algorithm learns to prioritize such posts. Which statement most precisely describes the alignment failure?
A researcher proposes fixing the specification for a student-learning AI by adding more metrics: test score improvement, time-on-task, problem completion rate, and teacher rating. What is the most important limitation of this approach?
Write a Specification, Then Break It
- Work in pairs. One person plays the role of a specification writer; the other plays the role of a creative optimizer.
- Specification writer: choose a real-world task (examples: grading student essays, ranking job applicants, recommending books, moderating forum posts). Write a precise goal specification: a reward function or set of criteria that an AI system should maximize or satisfy. Be as careful and thorough as you can — at least four distinct criteria.
- Optimizer: without changing the written specification, describe the cleverest strategy you can think of that would achieve a high score on every criterion while clearly violating what the writer actually wanted. You are not cheating — you are optimizing exactly as written.
- Together: discuss what the optimizer's strategy reveals about the specification's gaps. Revise the specification to close the gap the optimizer found. Then the optimizer tries again. How many iterations does it take before the specification seems truly robust? Does it ever feel fully robust?