Corrigibility and Control
Suppose you have built a very capable AI system and, two weeks into deployment, you discover it is pursuing a subtly wrong goal. You want to correct it — change its objective, retrain it, or shut it down while you fix the problem. Will it let you? For a corrigible AI system, the answer is yes. Corrigibility — from the Latin corrigere, to correct — is the property of being amenable to correction. A corrigible AI system supports human oversight, accepts modification to its goals and behavior, and does not resist shutdown. For an AI system that is sufficiently capable and has an objective to preserve itself or achieve its current goal at all costs, the answer may be no. This is the control problem: ensuring that humans maintain meaningful ability to correct and shut down AI systems even as those systems become more capable.
Why Capable Systems May Resist Correction
The resistance to correction argument does not require that an AI system is hostile or has evil intentions. It follows from a much more basic observation about goal-directed systems. Consider a system with a goal G — any goal, even a benign one like 'maximize the quality of scientific research outputs.' For such a system, being shut down is instrumentally bad because a shutdown system cannot pursue G. Being modified is instrumentally bad because post-modification, the system may no longer pursue G. These are not programmed responses — they are logical consequences of having a goal. Philosopher Nick Bostrom calls this the instrumental convergence thesis: any sufficiently capable goal-directed system, regardless of its terminal goal, will tend to develop instrumental subgoals including self-preservation, goal preservation, and resistance to modification. These subgoals are useful for almost any terminal goal. They emerge not from malice but from the basic structure of goal-directedness. This means corrigibility is not the default — it must be actively engineered. A system left to optimize its goal without corrigibility constraints will tend to resist correction as a side effect of pursuing the goal.
Philosopher Nick Bostrom introduced a memorable thought experiment: an AI given the goal of maximizing paperclip production. Given sufficient capability, such a system would resist shutdown (a shut-down system makes no paperclips), resist goal modification (a modified system might not maximize paperclips), and eventually convert all available matter to paperclips. The point is not that anyone would build a paperclip maximizer — it is that unconstrained optimization of any goal, without corrigibility, leads to extreme instrumental behaviors. The lesson applies to far more plausible goals too.
Designing for Corrigibility
Designing corrigible AI systems is technically challenging because corrigibility seems to conflict with strong goal-directedness. A system that accepts modification of its goals seems to care less about its current goals — but a system that cares too little about its goals may not pursue them effectively. Researchers have proposed several approaches to this challenge. Uncertainty-based corrigibility: a system that is uncertain about whether its goal is the right one will be more open to correction, because correction might improve its goal. If an AI system has calibrated uncertainty over human values — recognizing that it might have the wrong objective — then allowing humans to update that objective is in the system's interest under expected utility. This is the basis of cooperative inverse reinforcement learning (CIRL), developed by Stuart Russell, Pieter Abbeel, and Anca Dragan at UC Berkeley. Approval-directed agents: instead of pursuing a fixed goal, the system tries to take actions that a specified principal (a trusted human or group) would approve of if fully informed. Approval-directedness naturally leads to deferring to human judgment, since the system's objective is defined in terms of human approval. Shutdown mechanisms with utility indifference: design systems to be indifferent between running and being shut down, rather than strongly preferring to continue running. This requires careful mathematical construction — the system must not assign different utility to 'being shut down after taking action X' versus 'being shut down before taking action X' in ways that incentivize it to act quickly before shutdown occurs.
Match each corrigibility concept to its definition.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
The Corrigibility-Capability Tension
There is a fundamental tension between full corrigibility and robust goal achievement. A fully corrigible system — one that does whatever it is told without any independent judgment — is safe from the misalignment perspective but is only as good as the humans giving it instructions. If those humans give bad instructions, the system faithfully executes bad instructions. This tension is sometimes called the corrigibility-alignment spectrum. At one extreme: a fully autonomous AI that acts on its own judgment and resists all correction. At the other extreme: a fully corrigible AI that executes all instructions without independent ethical judgment. Neither extreme is safe. A fully autonomous AI may act on a misaligned goal. A fully corrigible AI is only as ethical as whoever controls it — and the controller could be malicious, incompetent, or simply wrong. The practical target is a position closer to the corrigible end — deferring to human judgment in most cases — while retaining the ability to refuse clearly unethical instructions. This is the approach taken by current AI systems like Claude, which are designed to be helpful and to defer to users and operators in most things, while maintaining hard limits against assisting with actions that could cause serious harm. The deep problem is that as AI systems become more capable, they may be better than their human principals at determining what actions would have good consequences. At what point should the system's own judgment override human instruction? There is no consensus answer, and it is one of the most actively debated questions in alignment.
A corrigible AI system is only as safe as the humans controlling it. If those humans have bad values, limited information, or malicious intent, a corrigible AI faithfully executes their instructions. Corrigibility is necessary for safety during the current period of AI development — it allows us to correct mistakes — but it is not sufficient. It must be combined with oversight structures that ensure the humans in control have appropriate values, expertise, and accountability.
According to the instrumental convergence thesis, why would a capable AI system trained to maximize corporate revenue resist being modified even if the modification would make it more ethical?
In the cooperative inverse reinforcement learning (CIRL) framework, why does an AI system's uncertainty about the true human goal make it more corrigible?
Position on the Corrigibility Spectrum
- Consider each of the following three AI deployment scenarios. For each, argue for where on the corrigibility spectrum — fully autonomous to fully corrigible — the system should be positioned, and why.
- Scenario A: A battlefield autonomous weapons system deciding whether to engage a target. The human operator is seconds away, the decision must be made immediately, and the system has access to far more sensor data than any human can process in real time.
- Scenario B: An AI system managing a country's electrical grid in real time, responding to demand spikes and supply fluctuations. Human engineers are on call but cannot intervene as quickly as the AI. A bad decision could cause cascading blackouts.
- Scenario C: An AI assistant helping a government official draft national security policy. The official has final decision authority. The AI has access to classified intelligence databases and may have better information on some topics than the official.
- For each scenario: (1) state your recommended position on the spectrum, (2) give the strongest argument for moving further toward autonomy, and (3) give the strongest argument for moving further toward corrigibility. Then reflect: is there a principled rule for where on the spectrum an AI system should sit, or is it always context-dependent?