Interpretability: The Understanding Gap
A frontier language model has hundreds of billions of parameters — numerical values distributed across thousands of matrices, processed through dozens of layers of computation. When it produces an output, the computation that generated that output has involved billions of floating-point multiplications and additions. No human can trace that computation in detail. We can observe the input and the output; the middle is a black box. This is the interpretability problem: we build and deploy powerful AI systems whose internal computations we do not understand. The problem is not merely academic. Deployed AI systems make consequential decisions — in healthcare, criminal justice, credit, hiring, and military applications — and the inability to explain why a decision was made creates accountability, safety, and trust problems that have no easy technical resolution.
What Interpretability Means and Why It Matters
Interpretability research aims to answer questions of the form: why did this model produce this output? Which parts of the input were most influential? What internal representations does the model form? What is the model 'doing' computationally when it solves a task? These questions have different answers depending on the level of analysis. At the coarsest level, input attribution methods try to identify which input tokens or pixels most influenced the output — a saliency map highlights which words in a sentence most affected a sentiment classification. At a finer level, probing classifiers test whether specific model layers contain linearly decodable information about specific concepts (does layer 12 of a language model represent syntactic structure?). At the finest level, mechanistic interpretability attempts to reverse-engineer specific circuits — subgraphs of the neural network that implement specific algorithms. Why does this matter practically? Consider a hospital deploying a model that reads patient notes and flags high readmission risk. The model achieves 85% accuracy — substantially better than the clinical baseline. The hospital wants to deploy it. A clinician asks: for this specific patient, why is the model predicting high risk? If the model is a black box, the answer is 'because the model says so.' That is not an acceptable clinical justification. The clinician cannot verify whether the model is tracking genuine medical predictors or spurious signals. They cannot explain the decision to the patient. They cannot override the model with medical knowledge if they cannot see its reasoning.
In medicine, law, finance, and public administration, decisions must be explainable to the people they affect. An AI system whose reasoning cannot be articulated is not just a technical limitation — it is an accountability gap with legal, ethical, and practical consequences.
Current interpretability methods fall into two broad categories: post-hoc and intrinsic. Post-hoc interpretability applies to a trained model after the fact. The model is treated as a fixed function, and tools analyze its behavior. LIME (Local Interpretable Model-agnostic Explanations) fits a simple interpretable model (a linear function) to a model's behavior in a local neighborhood around a specific input, using that linear approximation as an explanation. SHAP (SHapley Additive exPlanations) uses game-theoretic concepts to assign each input feature a contribution score for a specific prediction. These methods have significant limitations. They describe behavior in a local neighborhood, not the model's actual computation. Two very different internal mechanisms can produce the same local explanation. Adversarial examples — inputs slightly modified to cause misclassification — reveal that post-hoc explanations can be stable even as the model's output flips. If the explanation does not change when the output changes, the explanation is not capturing the actual decision mechanism. Intrinsic interpretability trains models that are interpretable by design — decision trees, rule lists, linear models. These are interpretable because their structure is human-readable. The limitation is a capability-interpretability tradeoff: for most complex tasks, interpretable models are substantially less capable than deep neural networks. Choosing full interpretability means accepting a lower performance ceiling. Mechanistic interpretability is a newer research direction, pioneered partly by Anthropic's interpretability team and researchers like Chris Olah. It attempts to understand what neural networks compute at the circuit level — identifying specific neurons, attention heads, and their interactions that implement specific algorithms. Early results are striking: circuits have been found that implement induction (detecting repeated patterns), indirect object identification in language, and modular arithmetic. But these circuits have been found in small models or specific small components of large models; scaling mechanistic understanding to an entire frontier model remains an open problem.
Match each interpretability method or concept to its correct description.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
The Stakes and Open State of the Problem
The interpretability gap is not narrowing as fast as capability is advancing. Models are becoming more powerful faster than interpretability tools can explain them. This creates a widening accountability deficit: as AI systems take on more consequential roles, our ability to audit their reasoning decreases relative to the complexity of that reasoning. From a safety perspective, interpretability is critical for detecting misalignment — the concern that a model has learned objectives or values that differ from what its designers intended. Without the ability to examine what a model is optimizing for internally, detecting misalignment requires purely behavioral testing, which is subject to exactly the distribution shift and brittleness problems discussed in previous lessons. A model might behave correctly under all tested conditions while pursuing a subtly different goal that only manifests on out-of-distribution inputs or in novel situations. The EU AI Act and similar regulatory frameworks require explanation of automated decisions in high-stakes contexts. The legal concept of 'right to explanation' exists in the GDPR. These legal requirements create practical pressure: you cannot deploy a black-box model for certain decisions if you are legally required to explain them. Current post-hoc interpretability methods do not reliably produce explanations that meet legal standards of accuracy. Mechanistic interpretability remains one of the most important open problems in the field — not because it would necessarily be exploited by bad actors if solved, but because without it, the field cannot verify that its most powerful systems are doing what they are supposed to be doing.
A bank's loan-rejection model is legally required to provide applicants with a reason for denial. The bank uses LIME to generate explanations. A researcher finds that LIME's explanation for a rejected applicant stays the same even when the applicant's input is changed in ways that flip the model's decision. What does this reveal?
Which of the following best describes the capability-interpretability tradeoff?
The Interpretability Stakeholder Map
- Consider an AI model used to recommend which prison inmates should be granted parole. The model takes as input: offense history, behavior reports, age, sentence length, program participation, and psychological assessment scores. The output is a risk score from 1 to 10 that informs the parole board's decision.
- Step 1: List five stakeholders who have a legitimate interest in interpreting or explaining this model's decisions. For each, state what specific question they need the model to answer and why.
- Step 2: For each stakeholder, evaluate whether current interpretability methods (SHAP, LIME, probing, mechanistic circuits) could realistically satisfy their need. Explain your reasoning.
- Step 3: Identify which stakeholder has the most urgent interpretability need and which has the least. Justify your ranking.
- Step 4: Write a one-paragraph policy recommendation: under what conditions, if any, should this model be permitted to inform parole decisions given the current state of interpretability research?
- Step 5: Discuss as a class: Is 'it works better than humans on average' a sufficient justification for deploying an uninterpretable model in this context? What would you need to believe to answer yes?