Measuring Fairness
Saying an AI system should be fair is easy. Actually measuring fairness — turning it into numbers you can check — turns out to be surprisingly hard. Researchers have identified over twenty distinct mathematical definitions of fairness, and they are not all compatible with each other. In this lesson, you will learn the most important fairness metrics, understand what they measure, and discover why no single metric tells the whole story.
Why Measurement Matters
You cannot improve what you do not measure. Before a team can make an AI system fairer, they need to know where it is unfair, for whom, and by how much. Fairness metrics give engineers a concrete way to compare groups — to see not just how well the system performs overall, but whether it performs equally well across different populations. This also means that claims like 'our AI is fair' or 'our AI is unbiased' must be backed by specific numbers. Which definition of fairness? Measured on which population? Under what conditions? Without answers to those questions, the claim is hollow.
Fairness is not just a feeling or a value. It can be turned into specific, quantifiable metrics that researchers and auditors can check and compare across groups.
Three Core Fairness Metrics
The three most commonly discussed fairness metrics each capture a different idea about what it means for an AI to treat groups equally. Demographic parity asks: does the AI make positive decisions at the same rate for every group? If a hiring AI accepts 40 percent of resumes from Group A and only 20 percent from Group B, it fails demographic parity — even if the groups have identical qualification levels. Equal opportunity asks: among the people who should qualify, are they being identified correctly at the same rate in each group? If a loan approval AI correctly approves 90 percent of creditworthy applicants from Group A but only 70 percent of equally creditworthy applicants from Group B, it fails equal opportunity. Predictive parity asks: when the AI says someone is high risk, is that prediction equally accurate for each group? If the system says 'high risk' and is right 80 percent of the time for Group A but only 60 percent of the time for Group B, it fails predictive parity.
Demographic parity: same acceptance rate across groups? Equal opportunity: same true-positive rate across groups? Predictive parity: same precision across groups?
Here is the difficult truth: it has been mathematically proven that it is impossible to satisfy all three of these metrics simultaneously when base rates differ between groups. You may have to choose which definition of fairness to prioritize — and that choice has real consequences for real people. This is not a software bug waiting to be fixed. It is a fundamental tension that requires human values and social decisions, not just technical solutions.
When the underlying rates of an outcome differ between groups, it is mathematically impossible to satisfy all fairness metrics at the same time. Engineers and policymakers must decide which definition of fairness matters most for a given application.
Auditing for Fairness
A fairness audit is a systematic examination of an AI system's behavior across different groups. Audits typically involve collecting a test dataset that is carefully balanced to include representative samples from all relevant groups, running the system on that test data, and measuring the outcomes for each group separately. External audits — conducted by researchers or organizations independent of the company that built the system — are especially valuable because internal teams may have blind spots or incentives to overlook problems. Researchers like Joy Buolamwini and Timnit Gebru conducted exactly this kind of external audit when they published the Gender Shades study.
Match each fairness metric to what it specifically measures.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A college admissions AI accepts 60 percent of applicants from Group A and 35 percent from Group B, even though both groups have identical average test scores and grades. Which fairness metric does this violate?
Researchers discover that when a crime-prediction AI labels someone 'high risk,' it is correct for Group A about 80 percent of the time but only 55 percent of the time for Group B. Which metric is violated?
Fairness Audit Simulation
- Step 1: Imagine a university uses an AI to help decide scholarship eligibility. You have access to the following hypothetical results from a test run:
- Group A (500 applicants): 200 accepted, 80 of those were students who actually needed the scholarship.
- Group B (500 applicants): 100 accepted, 60 of those were students who actually needed the scholarship.
- Assume all applicants who needed the scholarship actually deserved it by the defined criteria.
- Step 2: Calculate the acceptance rate for each group. Does the system satisfy demographic parity?
- Step 3: Among students who needed the scholarship, what percentage was correctly identified in each group? Does the system satisfy equal opportunity?
- Step 4: Of those accepted, what proportion actually needed it in each group? Does the system satisfy predictive parity?
- Step 5: Based on your analysis, write two sentences describing which groups are harmed and how, and which fairness metric you think matters most for scholarship decisions — and why.