Skip to main content
AI Safety, Alignment & Ethics

⏱ About 20 min20 XP

Evaluating AI Systems for Safety

How do you know if an AI system is safe? The question sounds simple. The answer is not. Safety is not a single property — it is a family of properties that include accuracy, fairness, robustness, reliability, alignment with intended use, and absence of harmful outputs. Measuring each requires different methodology, and measuring all of them together on a real system is a genuinely hard scientific and engineering problem. This lesson is about evaluation: how safety properties are defined, measured, and — critically — where measurement breaks down.

Benchmarks: The Standard Measure and Its Limits

The dominant paradigm for evaluating AI capability is the benchmark: a curated dataset of inputs with known correct outputs, used to produce a summary accuracy score. Benchmarks like ImageNet (image classification), SQuAD (reading comprehension), and MMLU (multidomain knowledge) have driven enormous progress by providing a common measuring stick that allows researchers worldwide to compare systems. But benchmarks have well-known failure modes as safety evaluation tools. Benchmark saturation: when multiple systems reach near-human or superhuman performance on a benchmark, the benchmark no longer distinguishes between them. MMLU was considered a demanding test of language model knowledge; by 2024, frontier models were approaching 90% accuracy on it, well above expert-human performance on many categories. Benchmark contamination: large language models are trained on enormous text corpora scraped from the internet. If benchmark questions and answers appear in that training data, the model may have memorized answers rather than reasoning its way to them. Evaluating a model on data it may have trained on produces inflated and untrustworthy scores. Goodhart's Law in benchmarks: when a benchmark becomes a target, optimizing for it displaces the actual goal. Models can be fine-tuned to perform well on specific benchmarks without improving on the underlying capability the benchmark was designed to measure. High benchmark scores can coexist with striking failures on closely related tasks.

Behavioral Testing Beyond Accuracy

Safety evaluation increasingly supplements benchmark accuracy with behavioral testing: systematic exploration of how a model responds to carefully chosen inputs. Checklist testing (from the CheckList methodology by Ribeiro et al.) defines minimum functionality requirements — things the model must always do, never do, or do consistently. A sentiment classifier must always predict 'positive' for 'I love this product' and 'negative' for 'I hate this product' regardless of other context, must not change predictions when only the person's name changes, and must be invariant to paraphrases with the same meaning.

Fairness Metrics and Their Tensions

A safety evaluation that only measures aggregate accuracy will miss disparate impacts. A loan classifier that is 90% accurate overall may be 95% accurate for one demographic group and 75% for another. Fairness metrics attempt to make these disparities visible. Demographic parity requires that a positive outcome be predicted at the same rate across groups. If 30% of applicants from Group A are approved and only 18% from Group B, demographic parity is violated regardless of whether the groups have different underlying credit risk. Equalized odds requires that the true positive rate and the false positive rate be equal across groups. A recidivism classifier satisfies equalized odds if it correctly identifies defendants who will reoffend at the same rate for all racial groups, and incorrectly flags defendants who will not reoffend at the same rate for all groups. Calibration requires that predicted probabilities match actual outcomes: a model that says 70% probability of default should be right 70% of the time, equally across groups. The Impossibility Theorem of fairness (Chouldechova 2017, Kleinberg et al. 2016) proves that demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously when base rates differ between groups. Every fairness metric embodies a value judgment about which inequalities are acceptable. Choosing a metric is a moral decision, not only a technical one.

Match each evaluation concept to the specific problem it is designed to detect or address.

Terms

Benchmark contamination
Demographic parity
Equalized odds
Goodhart's Law in ML
Behavioral checklist testing

Definitions

Verifies minimum functionality and invariance requirements beyond aggregate accuracy
Requires equal true positive and false positive rates across groups
Checks whether positive outcomes are predicted at equal rates across demographic groups
Optimizing for a benchmark metric can diverge from improving the underlying capability
Model may have memorized answers from training data, inflating apparent performance

Drag terms onto their definitions, or click a term then click a definition to match.

Evaluating Large Language Models for Safety

Large language models present special evaluation challenges. Their outputs are open-ended natural language rather than fixed-label classifications, which means standard accuracy metrics barely apply. Evaluation must cover a vastly larger behavioral space. Harmfulness evaluation attempts to measure whether a model produces harmful outputs — misinformation, instructions for dangerous activities, hate speech, manipulation. Evaluators construct test sets of prompts designed to elicit harmful responses and measure refusal rates and harm severity. The challenge: the space of possible harmful outputs is essentially unbounded, and a model can fail on prompts not in the test set. Honesty and calibration evaluation tests whether a model knows what it does not know — whether it expresses uncertainty appropriately, admits ignorance, and does not confidently confabulate false information. TruthfulQA is a benchmark specifically targeting questions where language models frequently produce plausible-sounding but false answers. Alignment evaluation attempts to measure whether a model behaves in accordance with human values — helpfulness, harmlessness, honesty — across diverse situations. Constitutional AI and RLHF produce models evaluated partly by human raters who score outputs on these dimensions. The reliability of human ratings as a gold standard is itself contested: different raters disagree, raters can be influenced by fluency, and what is helpful or harmful is context-dependent.

Evaluation Cannot Prove Safety — Only Provide Evidence

No finite evaluation can prove a complex AI system is safe in the way a mathematical proof proves a theorem. Evaluation provides evidence: it shows a system behaved safely on the evaluated inputs. Every evaluation has a coverage gap — inputs not included that could reveal failures. The larger and more open-ended the system, the larger the gap. Safety evaluation should be thought of as an ongoing practice of reducing uncertainty, not a one-time certification.

A recidivism prediction tool has 80% accuracy overall but researchers find it has a false positive rate of 44% for Black defendants versus 23% for white defendants. Which fairness criterion does this most directly violate?

A language model scores 89% on a reading comprehension benchmark, but analysis reveals many benchmark questions appeared verbatim in its training corpus. The benchmark score is:

Design a Safety Evaluation Protocol

  1. You have been asked to evaluate an AI system before deployment. Choose one of the following systems: (a) an AI tool that flags social media posts for review, (b) an AI that recommends whether a student should be placed in remedial reading, or (c) an AI that predicts hospital readmission risk.
  2. Step 1: List three safety properties that matter most for your chosen system. Be specific (e.g., not just 'accuracy' but 'false negative rate for high-severity cases').
  3. Step 2: For each safety property, describe how you would measure it. What data would you need? What metric would you use?
  4. Step 3: Identify two potential failure modes your evaluation might miss — cases or conditions where the system could still fail even if your evaluation passes.
  5. Step 4: Identify which group (demographic, geographic, or other) you are most concerned might be disparately affected, and how you would test for that.
  6. Step 5: Write a one-paragraph statement of what passing your evaluation would and would not guarantee about the system's safety.