Skip to main content
AI Foundations

⏱ About 20 min20 XP

Data Collection and Sourcing

'More data beats a better algorithm' is a phrase you will hear often in ML circles, and it contains real truth. Models learn by generalizing from examples, and examples are data. But the phrase is also incomplete and sometimes dangerously misleading. The quantity of data matters enormously — and so does its quality, its representativeness, and the process by which it was collected. A large dataset that is biased, stale, or mislabeled will train a model that is biased, stale, or wrong. This lesson examines where training data comes from and how to think critically about what it can and cannot represent.

Sources of Training Data

Training data comes from many sources, and each source carries its own characteristics and risks. Organically collected data is gathered as a byproduct of normal operations. A bank accumulates years of loan records with outcomes. An e-commerce site logs every click and purchase. This data is often large, cheap, and highly relevant — but it reflects the decisions made under the old system, which can encode historical biases into the new model. Manually labeled data is created specifically for training. Workers — sometimes professional annotators, sometimes crowd workers on platforms like Amazon Mechanical Turk — are shown inputs and asked to provide labels. This is expensive but gives you control over label quality. Medical imaging datasets, for example, typically require expert radiologists to label, costing thousands of dollars per case. Synthetic data is generated algorithmically rather than collected from the world. A self-driving car team might simulate thousands of rare accident scenarios that real-world collection would take years to encounter. Synthetic data avoids privacy concerns and can cover edge cases deliberately — but only if the simulation is realistic enough.

The Label Distribution Problem

A dataset where 98% of examples belong to one class and 2% to another is called class-imbalanced. A model trained on it can achieve 98% accuracy by predicting the majority class every time, while completely failing at the minority class. Class imbalance is not a data-cleaning problem — it is a data-collection and framing problem. Address it by collecting more minority examples, using resampling techniques, or choosing a metric that does not reward blind majority-class prediction.

Consider building a hate-speech classifier for a social platform. The data source is user posts. The problem: only a small fraction of posts are hate speech. If you sample randomly, your dataset is 98% benign and 2% harmful. You need to deliberately oversample the minority class — collect more harmful examples — to build a useful classifier. This is called stratified sampling: you sample each class at a rate that gives your model enough examples of each to learn the distinction. Another example: a facial recognition system trained on a dataset that is 80% photographs of lighter-skinned adults will perform worse on darker-skinned faces and children — not because the algorithm is inherently biased, but because the data was not representative of all the people the system will encounter in production. The sampling strategy determines whom the model learns about.

Quantity Versus Quality

The tension between data quantity and quality is real and important. More data, holding quality constant, almost always helps. But quality cannot be held constant — every expansion of a dataset introduces new risks. Label noise means some training labels are simply wrong. A crowd worker may mislabel a photograph. A sensor may malfunction. A historical record may have been entered incorrectly. Studies have shown that state-of-the-art models can tolerate surprisingly high label noise rates (sometimes up to 20-30%) and still perform well, because the correct labels in the majority overwhelm the incorrect ones during training. But above a threshold — and for certain task types — label noise degrades performance severely. Data staleness occurs when the world changes after the data was collected. A model trained on user behavior from 2019 may not represent 2024 behavior at all. The more your domain changes over time — social media trends, market conditions, language use — the more quickly your training data expires. The practical guidance: treat data quality issues as debts. Small debts are manageable; large ones compound. Document every known quality problem in your dataset, estimate its likely impact, and build monitoring to catch new problems after deployment.

Document Your Data Provenance

Data provenance means the complete history of where data came from, how it was collected, and who labeled it. Documenting provenance is not bureaucratic overhead — it is the only way to diagnose mysterious model failures months later. A structured document called a datasheet for datasets (proposed by Gebru et al., 2018) records this information in a standard format.

Fill in the blanks about data collection concepts.

When a dataset has far more examples of one class than another, it is called imbalanced. Collecting data in a way that deliberately represents each class is called sampling.

A team trains a loan-default model using 10 years of their bank's historical loan decisions and outcomes. The bank historically approved loans primarily to applicants from certain zip codes. What risk does this introduce?

A researcher argues that synthetic data is always inferior to real collected data. What is the strongest counter-argument?

Data Audit Design

  1. You are designing the data collection plan for a model that predicts whether a student will pass a standardized exam based on information available at the start of the school year.
  2. Step 1: List at least five potential data sources. For each, write one sentence on its strengths and one on its risks.
  3. Step 2: Identify any groups of students that might be underrepresented if you simply pull historical records from one school district. List at least three such groups.
  4. Step 3: Write a brief sampling strategy (3-5 sentences) that addresses the underrepresentation you identified.
  5. Step 4: Name one type of label noise specific to this scenario and propose how you would detect it.