Real Examples of AI Bias
It would be easy to think of AI bias as a theoretical problem — something that might happen someday. But bias in AI systems has already been documented in real systems deployed to real people. Looking at concrete examples is important because it turns an abstract concept into something specific and actionable. In this lesson, you will examine four well-documented cases, understand what went wrong in each, and consider what should have been done differently.
Case 1 — Amazon's Hiring Tool
In the mid-2010s, Amazon built an AI tool to help screen job applicants. The tool was trained on a decade of past hiring decisions. The problem: the tech industry had historically hired far more men than women. The training data reflected that history — resumes that led to hires skewed male. The AI learned this pattern and penalized resumes that included words associated with women, including the phrase 'women's chess club.' It also downgraded graduates of certain all-women's colleges. Amazon discovered the bias and eventually scrapped the tool entirely rather than try to patch it. The lesson: an AI trained on historically biased hiring decisions will reproduce those biased decisions. The data recorded discrimination as if it were a success signal.
Amazon's resume-screening AI was trained on ten years of male-dominated hiring decisions. It learned to penalize female applicants — including for including the word 'women's' in their resumes — and had to be abandoned.
Case 2 — COMPAS Recidivism Scores
COMPAS is a risk-assessment tool used in some US courts to estimate how likely a defendant is to re-offend after release. A 2016 investigation by the news outlet ProPublica found that COMPAS assigned significantly higher risk scores to Black defendants than white defendants with similar criminal histories. White defendants were more often incorrectly labeled low risk and went on to re-offend. Black defendants were more often incorrectly labeled high risk and remained incarcerated. The tool's creators disputed some of the analysis, and the debate over how to measure fairness in this context continues. But the core finding — that different groups experienced different error rates — illustrates a fundamental challenge: when a model is wrong, it is not always wrong equally for everyone.
A model can be 'accurate on average' while producing very different error rates for different groups. High false-positive rates for one group and high false-negative rates for another is a form of bias — even if overall accuracy looks fine.
Case 3 — Pulse Oximeters and Skin Tone
Pulse oximeters are devices clipped to a finger to measure blood oxygen levels. They work by shining light through the skin. Research published in 2020 and 2021 found that pulse oximeters were significantly less accurate for patients with darker skin tones — sometimes overestimating oxygen levels and missing dangerous low-oxygen events. This is not an AI example in the traditional sense, but the cause is identical: the devices were designed and validated using data primarily from lighter-skinned patients. Those who were underrepresented in the validation data received a worse-calibrated product. During the COVID-19 pandemic, when blood oxygen monitoring was critical, this gap in device performance had serious medical consequences for some patients.
When AI tools use pulse oximeter data as an input feature — as some patient risk algorithms do — they inherit the measurement bias baked into those device readings. Bias compounds: a biased measuring instrument feeds biased data to an AI, which produces a biased prediction.
Case 4 — Image Recognition and Facial Analysis
In 2018, researcher Joy Buolamwini and colleague Timnit Gebru published a landmark study called Gender Shades. They tested three commercial AI systems that classify gender from face photos. The systems worked well overall — but they were far more accurate for light-skinned men than for dark-skinned women. In the worst case, one system was 99 percent accurate for light-skinned men but only 65 percent accurate for dark-skinned women. The root cause was training data: the facial image datasets used were heavily skewed toward lighter-skinned faces. The AI learned what it was shown. The researchers' findings prompted major technology companies to audit and retrain their systems.
The Gender Shades study showed that widely used commercial AI systems had error rates up to 34 percentage points higher for dark-skinned women than for light-skinned men — a stark illustration of representation gaps in training data.
Flashcards — click each card to reveal the answer
What was the root cause of the bias in Amazon's resume-screening AI?
The Gender Shades study found that commercial face-analysis AIs were most accurate for which group and least accurate for which group?
Case Study Report
- Step 1: Choose one of the four cases from this lesson: Amazon, COMPAS, pulse oximeters, or Gender Shades.
- Step 2: Write a brief report with four sections:
- a) What the system was supposed to do.
- b) What bias was found and which groups were harmed.
- c) The most likely cause of the bias (data, design, or both).
- d) One specific change that could have reduced the bias.
- Step 3: Compare with a classmate who chose a different case. What do your cases have in common? What is different?
- Step 4: In two sentences, explain why you think it matters to study real examples rather than just abstract definitions.