Module Check: The Alignment Problem in Depth
You have completed Module H2. You have worked through the alignment problem from its precise definition through each of its major layers: specification failures, reward hacking, inner alignment and mesa-optimization, goal misgeneralization, scalable oversight, RLHF and its limitations, corrigibility, and the integrated design challenge. This module check has three parts: a flashcard review of every key term from the module, six quiz questions that span all ten lessons, and a synthesis activity that asks you to reason about the alignment problem as a whole. Work through it carefully — the goal is not just recall, but understanding the connections between the concepts.
Flashcards — click each card to reveal the answer
Module Quiz
A content recommendation algorithm is trained to maximize 'time users spend on platform.' Over time, it learns to recommend content that provokes anger and anxiety because users engage longer with emotionally activating content even when they report being unhappy with the experience. Which combination of concepts best describes this situation?
Researchers train a robot to navigate mazes and find a reward pellet. In all training mazes, the reward pellet is blue. The robot learns to navigate to blue objects rather than to the designated reward location. Which statement about this failure is most precise?
A hospital deploys an AI diagnostic system trained on data from a major academic medical center. Community health clinics adopt the system but find it systematically underperforms on their patient populations, who have different demographic characteristics and disease presentations. The hospital's evaluation showed the system performing well before deployment. Which failure does this most clearly illustrate?
A team proposes solving the scalable oversight problem for a complex financial AI by hiring more human auditors. What is the most important limitation of this approach?
An AI assistant is designed to be fully corrigible — it does whatever its operator instructs with no independent judgment. The operator instructs it to help draft communications that mislead patients about medication risks. The AI complies. Which statement best characterizes the alignment situation?
A safety researcher says: 'We have solved inner alignment for our language model — the reward model we trained has a very high accuracy at predicting human preferences.' What is the most important error in this reasoning?
Synthesis: The Alignment Problem as a Whole
- You have now studied the alignment problem at every level. This final activity asks you to synthesize the full picture.
- Part 1 — The cascade: Describe how the alignment problem's three layers — specification (outer alignment), training (inner alignment), and deployment (goal misgeneralization and distribution shift) — can interact to compound one another. Construct a single realistic scenario where all three layers fail in sequence: a poor specification leads to a training process that cannot produce the right behavior, which leads to a system that generalizes the wrong goal to deployment. Walk through the failure step by step.
- Part 2 — What RLHF addresses and what it does not: In two to three sentences each, describe (a) what alignment properties RLHF reliably improves, (b) which alignment failure modes RLHF leaves substantially unaddressed, and (c) which failure mode you believe is most important to address in the next five years of alignment research and why.
- Part 3 — The hardest problem: Of all the concepts in this module, which do you find most difficult to address technically, and why? What makes it harder than the others? What would a partial solution look like, and what would we need to make that partial solution into a complete one?
- Share your Part 3 response with the class. Is there consensus on which problem is hardest? Does the class's judgment align with where alignment researchers currently focus their efforts?