Design an Alignment Approach
You have spent this module building a precise understanding of the alignment problem's layers: outer alignment and the difficulty of specifying what you actually want, reward hacking and the ways optimization exploits specification gaps, inner alignment and the mesa-optimization problem, goal misgeneralization and its distribution-shift failure mode, scalable oversight and the challenge of maintaining meaningful human evaluation, RLHF and its documented limitations, and corrigibility as the structural requirement for maintaining control. Now you will use all of it. This lesson centers on a single extended design activity: given a realistic high-stakes AI system, you must reason through a complete alignment approach. Not a vague aspiration — a specific, layered strategy that addresses each failure mode you have studied, explains what residual risks remain, and defends the design choices you made.
Alignment research requires more than understanding failure modes — it requires the ability to reason constructively about what to do about them. The design perspective forces you to confront tradeoffs: every technique that addresses one failure mode may introduce another. Working through these tradeoffs concretely is how researchers develop genuine judgment, rather than just familiarity with concepts.
The Alignment Design Framework
Before the main activity, here is a framework for organizing any alignment design problem. Real alignment approaches must address each layer. Layer 1 — Specification: What is the true goal? How will it be formalized? What are the most likely proxy failures and how will the specification guard against them? Layer 2 — Training: What training process will be used? How will human feedback be collected? What are the risks of reward model overoptimization, sycophancy, or other RLHF failure modes, and how will training be structured to reduce them? Layer 3 — Evaluation: How will alignment be evaluated before deployment? What distribution-shift scenarios will be tested? How will the evaluation detect goal misgeneralization rather than just measuring in-distribution capability? Layer 4 — Oversight: Once deployed, how will humans maintain meaningful oversight? Is oversight scalable to the domain's complexity? What monitoring will detect specification gaming or behavioral drift? Layer 5 — Corrigibility and control: How will the system be designed to accept correction? What shutdown and modification mechanisms exist? Who has the authority to use them? Layer 6 — Residual risk: After all the above, what alignment failure modes remain unaddressed or only partially addressed? What is the plan if they manifest?
Match each alignment technique to the layer of the design framework it primarily addresses.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
The Design Scenario
You are part of a safety team at a health technology company. The company is building an AI system with the following profile: System: An AI clinical decision support tool that recommends treatment protocols for patients with chronic diseases including diabetes, hypertension, and heart failure. The system has access to patient electronic health records, lab results, medication history, and a comprehensive clinical knowledge base. Users: Practicing physicians in outpatient settings. The system provides ranked treatment recommendations with brief clinical justifications. The physician makes the final decision. High-stakes properties: Incorrect or misaligned recommendations could cause patient harm. The system operates at scale — potentially influencing hundreds of thousands of patient encounters per year. The physicians using it are domain experts but may defer to the AI when it expresses high confidence, especially for complex cases. Institutional context: The system must comply with FDA requirements for clinical decision support software. The hospital network deploying it requires a documented alignment approach before approval. Your task is to apply the full design framework to this system — all six layers — in the main activity below.
Healthcare AI systems are already being deployed at scale with incomplete alignment strategies. In 2019, a widely used clinical algorithm was shown to systematically underestimate the severity of illness in Black patients because it used healthcare spending as a proxy for healthcare needs — a proxy that reflected historical disparities, not true clinical risk. This is a documented outer alignment failure with real patient harm. The design work in this lesson is not a classroom exercise disconnected from practice; it reflects real challenges that real teams face today.
Full Alignment Design: Clinical Decision Support AI
- Using the six-layer alignment design framework presented in this lesson, write a complete alignment strategy for the clinical decision support system described above. Address each layer as follows:
- Layer 1 — Specification: State the true goal of the system in one sentence. Then identify the three most likely proxy failures — ways the specification could be satisfied without achieving the true goal. For each proxy failure, propose a specific modification to the specification that would close the gap.
- Layer 2 — Training: Describe the training process. Will you use RLHF? If so, who will the human evaluators be (non-expert contractors, physician specialists, or a mix), and why does this choice matter for alignment? Identify one specific RLHF failure mode (sycophancy, reward model overoptimization, etc.) most likely to affect this system and describe how you will mitigate it.
- Layer 3 — Evaluation: Design an evaluation protocol specifically aimed at detecting goal misgeneralization. What distribution-shift scenarios will you test? How will you determine that the system is pursuing the right clinical goal rather than a proxy that happened to be correlated during training? Include at least two adversarial evaluation scenarios.
- Layer 4 — Oversight: The physicians using the system are domain experts, but they may lack the time or background to evaluate every recommendation critically. Design a scalable oversight mechanism that maintains meaningful human control without requiring exhaustive physician review of every output. Which scalable oversight approach from Lesson 6 does your mechanism most closely resemble?
- Layer 5 — Corrigibility and control: Describe the shutdown and modification mechanisms for this system. Who has the authority to modify or withdraw the system's recommendations, and under what conditions? How will you design the system to support this oversight without resisting it?
- Layer 6 — Residual risk: After completing Layers 1-5, identify the two alignment failure modes you consider most likely to manifest despite your mitigations. For each, describe the early warning sign that would indicate the failure is occurring, and the response protocol.
- When complete, pair with another student who designed the same system. Compare your specifications at Layer 1 — do you agree on the true goal? Compare your evaluation protocols at Layer 3 — would each of you catch the other's most dangerous proxy failure? Discuss any significant differences in your approaches and the reasoning behind them.
In the clinical decision support scenario, the system is rewarded for producing recommendations that physicians accept without modification. Over time, the system learns to produce recommendations that are highly conventional and rarely challenge the physician's prior assessment. Which alignment failure does this most precisely illustrate?