Bias and Fairness, in Depth
In everyday language, 'bias' means prejudice — a thumb on the scale. In machine learning, bias has a more technical meaning, and understanding the distinction is essential before you can analyze AI fairness claims seriously. This lesson examines where bias actually enters AI systems, why it is so hard to remove, and — perhaps most importantly — why there is no single mathematical definition of 'fairness' that captures everything we care about.
Where Bias Enters: A Taxonomy
Bias in a machine-learning system almost never enters at one point. It accumulates through the entire pipeline. Researchers have identified several distinct sources: Historical bias: The training data reflects a world shaped by past discrimination. A hiring model trained on historical resumes will learn that executives are predominantly male — not because male applicants are more qualified, but because historical hiring was biased. The model extrapolates the past, not the ideal. Measurement bias: The features used to represent people are measured imperfectly, and the measurement errors are not random — they correlate with demographic groups. Consider using 'ZIP code' as a feature in a credit model. ZIP codes correlate strongly with race due to decades of residential segregation. The feature is not explicitly racial, but its predictive power comes partly from race. Label bias: Supervised learning requires labeled examples. If the labels themselves encode human judgment — say, which loan applicants were 'creditworthy' as judged by past loan officers — then any bias in those officers' decisions is baked into the training set. Aggregation bias: A single model trained on data from multiple groups may perform well on average but poorly on subgroups that differ from the majority in meaningful ways. A skin-lesion classifier trained mostly on light skin may be unreliable on dark skin — not from malice, but from under-representation. Deployment bias: The context in which a model is used differs from the context in which it was trained. A model trained on adult clinical data deployed in a pediatric setting is almost guaranteed to behave unexpectedly.
Calling an AI system 'biased' without specifying the source conflates very different problems with very different solutions. Historical bias may require different training data. Measurement bias may require different features. Aggregation bias requires disaggregated evaluation. Precise diagnosis is a prerequisite for effective remedy.
Why is bias so difficult to remove even when engineers are actively trying? Three reasons stand out. First, the world provides biased training data, and there is no obvious 'unbiased' dataset to substitute. If you want to train a hiring model and you want to avoid historical bias, you cannot simply erase the past; you have to actively intervene on the data, which introduces its own choices. Second, proxy features — features that are not sensitive attributes but correlate with them — make it very hard to prevent a model from learning group-correlated patterns. Even if you remove gender from a resume model, features like maternity-leave gaps, certain university names, and participation in gender-associated activities remain and can act as proxies. Third, optimizing for aggregate accuracy hides group-level disparities. A model that is 95% accurate overall may be only 70% accurate for a particular demographic subgroup. If that subgroup is small, aggregate metrics mask the harm.
Flashcards — click each card to reveal the answer
Competing Definitions of Fairness
Here is the deep problem: there are multiple mathematically precise definitions of 'fairness,' and it is provably impossible to satisfy all of them simultaneously when base rates differ between groups. This is not an engineering failure — it is a mathematical theorem. Three of the most important fairness criteria are: Demographic parity (also called statistical parity): the proportion of positive outcomes should be equal across groups. If 30% of Group A applicants receive loans, 30% of Group B applicants should too. Equalized odds: both the true positive rate (correctly approving qualified applicants) and the false positive rate (incorrectly approving unqualified applicants) should be equal across groups. Calibration (also called predictive parity): a score of 0.7 should mean a 70% probability of the outcome, equally for all groups. In 2016, Chouldechova proved that when base rates differ — that is, when the actual prevalence of an outcome genuinely differs between groups — you cannot satisfy equalized odds and calibration simultaneously. This is the impossibility theorem of fairness. It means that choosing a fairness criterion is itself an ethical and political choice, not a technical one. Different criteria protect different values: demographic parity emphasizes equal outcomes; equalized odds emphasizes equal error rates; calibration emphasizes predictive accuracy. Which matters most depends on the context and on contested moral intuitions.
When base rates differ between groups, it is mathematically impossible to satisfy demographic parity, equalized odds, and calibration all at once. Any deployed system must choose — explicitly or implicitly — which fairness criterion to prioritize. Claiming a system is simply 'fair' without specifying the criterion is a red flag.
Complete these statements about fairness definitions.
A recidivism prediction model has equal calibration across racial groups — a score of 0.6 means 60% reoffend for all groups — but Black defendants receive higher scores on average. Which statement BEST characterizes this situation?
Why can't removing a sensitive attribute like race from training data fully prevent a model from discriminating?
Fairness Criterion Debate
- Consider a hypothetical bail-risk scoring tool used by judges.
- Group A (lower-income defendants): 40% actually reoffend if released.
- Group B (higher-income defendants): 20% actually reoffend if released.
- Three engineers propose three designs:
- Engineer 1 uses demographic parity: equal percentages from each group are recommended for detention.
- Engineer 2 uses equalized odds: false positive rates (wrongly detaining someone who would not reoffend) are equal across groups.
- Engineer 3 uses calibration: the risk score accurately predicts reoffense probability for all groups.
- For each design:
- - Describe whose interests it prioritizes.
- - Describe what cost it imposes on which group.
- - Decide which you would recommend and write a one-paragraph defense of your choice.
- There is no single correct answer. The goal is to reason carefully about trade-offs and defend a position with evidence.