Who Gets Left Out
Every AI system is built by some people, using data produced by some people, for some intended users. The question of who appears in that data — and who does not — turns out to be one of the most important fairness questions in all of AI. When a group is left out of the training data, the AI performs worse for that group. And the groups most often left out tend to be the groups with the least power to complain about it.
What Is a Representation Gap?
A representation gap is a disparity between how often a group appears in the training data and how often that group appears in the real world (or in the population the AI will be used on). If a group makes up 40 percent of the population but only 5 percent of the training data, the model has far fewer examples of that group to learn from. Its predictions for that group will be less reliable. This does not automatically mean the model is wrong about everything for that group. But it means the model's weaknesses are concentrated there — and its mistakes are more likely to affect people in the underrepresented group.
A representation gap occurs when a group is much less present in training data than in the real population. Models learn less about underrepresented groups and perform worse for them.
Representation gaps show up in almost every domain where AI is built: Images: many facial recognition and image classification datasets have been heavily skewed toward people from North America and Europe, under-representing people from Africa, Asia, and Latin America. Language: most large language models are trained on text from the internet, which skews toward English and toward the demographics with high internet access and writing activity. Medical data: clinical trial data has historically underrepresented women, older patients, and people of color — because trials were often run on easier-to-recruit populations. Speech recognition: voice interfaces trained on standard American English accents perform significantly worse for speakers with other accents, dialects, or speech patterns.
Who Typically Gets Left Out?
Across many AI systems, the groups most likely to be underrepresented in training data share something in common: they have had less access to the platforms and institutions that generate digital data. Wealthy, educated, English-speaking users in high-income countries generate enormous amounts of digital data. Rural communities, elderly populations, people with disabilities, people in low-income countries, and speakers of minority languages generate far less — not because they do not exist, but because the data-collection infrastructure reached them less. This means that AI improvements tend to disproportionately help already-advantaged groups. The gap between AI performance for the well-represented and the underrepresented can widen over time rather than narrow.
AI systems tend to work best for groups already well-represented in data — often those with more power and resources. This can widen existing inequalities rather than reduce them.
The Compounding Problem
Representation gaps are especially serious when they compound. Consider a patient from a minority ethnic background who speaks a non-dominant language, lives in a rural area, and has a rare health condition. She may be underrepresented in the training data on every one of those dimensions simultaneously. The more axes on which someone is underrepresented, the worse the AI tends to perform for that person. This intersectionality of underrepresentation is often invisible to developers who look only at average performance metrics. A system can perform at 95 percent accuracy overall while performing at 60 percent accuracy for a specific underrepresented subgroup — and that subgroup may be too small to affect the overall number significantly.
Match each group to the representation gap they commonly experience in AI systems.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
What is a representation gap in AI?
Why might a voice assistant perform worse for a speaker with a strong regional accent?
Who Was Left Out?
- Step 1: Choose one AI application from this list: a medical diagnosis tool, a hiring resume screener, a face recognition security system, or a language translation tool.
- Step 2: Brainstorm which groups might be underrepresented in the training data for your chosen application. Consider: age groups, geographic regions, languages or dialects, income levels, gender, disability status.
- Step 3: For each group you identified, describe one real-world harm that could result if the AI performs worse for that group.
- Step 4: Propose two specific steps the development team could take before and during data collection to reduce these representation gaps.
- Step 5: Reflect: who do you think should have a say in deciding whose data gets included? Just the engineers? The users? Government regulators? Explain your reasoning.