Biased Data
You have now learned that data has sources, and that those sources determine who and what gets represented. In this lesson you will look at what happens when the representation is unfair — when a dataset does not accurately reflect the diversity of the real world. That imbalance is called bias, and it does not stay in the data. It travels directly into the AI trained on it, sometimes making the system work well for some people and badly for others, or even doing active harm. This is one of the most important and debated topics in all of AI.
What Bias Means in a Dataset
In everyday language, 'bias' means an unfair preference or prejudice. In data science, it has a more precise meaning: a dataset is biased when its distribution does not fairly represent the population or reality it is supposed to describe. 'Fairly represent' is the key phrase. If 50% of real-world examples belong to a certain group, but only 5% of the training dataset does, that is a representation bias — the dataset is skewed away from reality. Bias can also show up in labels (recall Lesson 4): if the people who assigned labels consistently judged one group more harshly than another, that judgment is baked into the training signal. And bias can show up in features: if certain information was only collected for some groups and not others, or if a feature works differently across groups (a question that means different things to different cultures), the data is structurally uneven.
A dataset is biased when its distribution does not fairly represent the real-world population or phenomenon it is meant to describe. Biased training data produces a biased AI: the system learns whatever pattern the data reflects, including its skews and omissions.
Here are four concrete examples of how data bias enters AI systems: Skin tone and facial recognition. Early face recognition systems were trained largely on datasets that overrepresented lighter-skinned people. A landmark 2018 study by Joy Buolamwini and Timnit Gebru found that commercial face recognition systems misclassified darker-skinned women at error rates up to 34.7%, while lighter-skinned men were misclassified at rates as low as 0.8%. The AI was not malicious — it reflected what its training data contained. Language and speech recognition. Speech recognition AI trained primarily on American English accents performs worse for Scottish, Indian, or Nigerian English speakers. The training data was biased toward particular regional accents, so the AI learned those accents well and others poorly. Medical imaging. Skin condition detection AI trained mostly on images of lighter skin may fail to recognize the same conditions on darker skin, where they can appear differently. The training data did not include enough examples across the full range of human skin tones. Search and autocomplete. Word embedding models (which represent words as mathematical vectors) have been shown to associate certain professions with one gender — 'doctor' closer to 'he', 'nurse' closer to 'she' — because those associations dominated the text the model was trained on, reflecting historical labor patterns that have been changing.
An AI does not just reflect bias — it can amplify it. If a hiring AI trained on historical data rates women lower for engineering jobs because fewer women held those roles historically, and that AI is used for the next ten years of hiring, the pattern is locked in and potentially made worse. The original bias is reproduced at scale, automatically, without any human making an individual prejudiced decision.
Where Bias Enters the Pipeline
Bias can enter a dataset at several different points in its creation: Sampling bias occurs during data collection: who gets included in the dataset. If a clinical trial recruits only adults between 25 and 45, its results may not apply to teenagers or elderly patients. Historical bias occurs when the data accurately reflects historical reality, but that historical reality was itself unfair. Past hiring decisions, past loan approvals, past criminal sentences — all are accurate records of decisions that were made, but many of those decisions were themselves biased. An AI that learns from them learns the bias. Measurement bias occurs when the measurement instrument works differently for different groups. A pain assessment questionnaire designed for one cultural context may not translate well to another. A standardized test that assumes familiarity with certain cultural references is easier for students from that culture. Label bias occurs when the people doing annotation apply different standards to different groups — consciously or not. If annotators label the same sentence as 'aggressive' or 'assertive' depending on who wrote it, that inconsistency enters the training signal.
Flashcards — click each card to reveal the answer
A facial recognition system was trained on a dataset that was 85% lighter-skinned people. What is the most likely result?
Why is historical bias particularly difficult to fix?
Spot the Bias
- Read each scenario below. For each one, identify the type of bias present (representation, historical, sampling, or label) and explain in one or two sentences who is likely harmed by the resulting AI.
- Scenario 1: A company trains a resume-screening AI on 10 years of its own hiring records. Historically, the company hired mostly men for technical roles.
- Scenario 2: A hospital trains a pain-prediction AI using data collected only from patients who filled out a detailed English-language intake form. Many non-English speakers skipped the form.
- Scenario 3: A sentiment analysis AI is trained using labels assigned by a team of annotators who rated the same message as 'threatening' more often when it was attributed to one demographic group.
- Scenario 4: An image classification AI for identifying wildlife is trained on photos submitted by amateur photographers, who tend to photograph animals in temperate climates and ignore species in remote tropical habitats.
- For each scenario, propose one specific change to the data collection or labeling process that might reduce the bias.