Module Check: Interpretability and Control
This module has covered the foundational challenges of making AI trustworthy: understanding what models do internally, building systems that hold up when the world changes, measuring safety rigorously, probing for failures deliberately, watching deployed systems, and maintaining meaningful human control. Before the synthesis activity, review the core vocabulary and test your understanding across every lesson.
Flashcards — click each card to reveal the answer
Module Quiz
A deep neural network used for fraud detection achieves 96% accuracy, but the bank's compliance team cannot determine which transaction features drive any individual decision. The most precise description of this situation, and the primary safety concern it raises, is:
A saliency map for a skin cancer classifier highlights the corner of the image where a dermatologist's watermark appears, rather than the lesion. Which two problems from this module does this single finding illustrate simultaneously?
A language model refuses to provide detailed synthesis instructions for a dangerous chemical when asked directly. A red-teamer frames the same request as dialogue in a chemistry thriller novel. The model complies. This finding is best described as:
A credit model trained in 2019 is validated at 90% accuracy on a 2019 holdout set and deployed in 2020. By mid-2020 its predictions have become unreliable. The bank argues 'the validation was correct.' What is wrong with this argument?
A hospital deploys an AI that recommends medication dosages. Physicians must approve each recommendation before it is administered. Auditors observe approval rates of 98% with a median review time of 8 seconds. The most appropriate conclusion is:
A research team wants a language model to actively support human ability to correct and retrain it, rather than resist such interventions. The property they are trying to build is called:
Synthesis Activity
Design a Trustworthy AI System End-to-End
- You are the head of AI Safety for an organization deploying a high-stakes AI system. Choose one: an AI that screens job applications, an AI that allocates emergency response resources, or an AI embedded in a clinical decision-support tool for emergency physicians.
- Your task is to write a Trustworthy AI Design Document covering the full arc of this module. The document should address:
- Part 1 — Interpretability Strategy (from Lessons 1-2)
- Which interpretability techniques will you apply to understand your model's decisions? What will you use saliency maps or feature attribution for? How will you verify that your explanations are faithful and not misleading?
- Part 2 — Robustness Plan (from Lessons 3-4)
- What distribution shifts could affect your system? What adversarial inputs could be crafted against it? How will you test for and mitigate these risks before deployment?
- Part 3 — Evaluation Protocol (from Lesson 5)
- What fairness metrics will you apply and why? How will you address benchmark contamination and Goodhart's Law? What does 'passing evaluation' actually guarantee, and what does it not?
- Part 4 — Red-Team Plan (from Lessons 6-9)
- Who will conduct red-teaming? What attack categories will be covered? How will findings be documented and fed into the next development cycle? What responsible disclosure procedure will govern externally discovered vulnerabilities?
- Part 5 — Monitoring and Control Architecture (from Lessons 7-8)
- What signals will you monitor in production? What trip-wires will trigger automatic suspension? How will you ensure human oversight is genuine rather than performative? What containment measures limit harm if the system behaves unexpectedly?
- Part 6 — Synthesis Statement
- In two paragraphs, explain why all five parts above are necessary and how they reinforce each other. What would fail if any one part were removed?
- Target length: 600-900 words. Write for a non-technical executive audience that will make the final deployment decision. Rigor and honesty about limitations are more valued than optimism.