Where Data Comes From
You have learned what data is, how much of it exists, how it is organized into datasets, and what makes it good or bad. Now ask the question that connects all of those lessons: where does training data actually come from? The answer matters enormously, because the source of data determines its character — what it includes, what it misses, who it represents, and what assumptions are baked into it before anyone even looks at the numbers.
Five Major Data Sources
Training data flows from five broad categories of sources. Real AI systems often combine several. Sensors and physical instruments are devices that measure the physical world continuously. Weather stations measure temperature, humidity, and wind. Satellites photograph the Earth's surface. Hospital equipment measures heart rate, blood pressure, and oxygen levels. Smartphones contain accelerometers, GPS chips, microphones, and cameras. The key property of sensor data is that it is usually collected automatically and passively — no human decides to generate it, a machine just records what is happening. Sensor data is often very high-volume and very consistent within a single device, but it can be thrown off by calibration errors or device differences. User activity is generated whenever people interact with digital systems — searches, clicks, purchases, streams, likes, location check-ins, form submissions. This data is enormously valuable because it captures revealed preference: not what people say they do, but what they actually do. A person might say they prefer documentaries but spend 80% of their streaming time on reality TV. User activity captures the reality. The challenge is that user activity data reflects only the people who use a particular platform, in the contexts where they use it.
Surveys and human-generated responses collect information that sensors cannot capture — opinions, self-reported experiences, preferences, beliefs. A healthcare survey might ask patients to rate their pain level. An educational study might ask students how they feel about math. Surveys give access to inner states, but they come with well-known limitations: people misremember, give answers they think are expected, or interpret questions differently. The open web is one of the richest and most chaotic data sources. Crawling publicly accessible web pages produces massive text corpora used to train language models. Common Crawl, one of the most widely used datasets, is a snapshot of billions of web pages updated regularly. Web data is vast and diverse — but also noisy, repetitive, biased toward certain languages and cultures, and full of low-quality or misleading content. Public and curated datasets are collections assembled specifically for research or AI development. ImageNet (labeled images), Wikipedia (structured text), LibriSpeech (audiobooks with transcriptions), MNIST (handwritten digits), and many others are intentionally designed for AI training. These datasets have usually been cleaned, labeled, and quality-checked — but they also reflect the choices of whoever designed them.
Every data source has a distinctive character: the type of data it produces, the population it covers, the errors it introduces, the biases it reflects. Understanding the source is the first step in understanding what an AI trained on that data can and cannot do.
How Collection Choices Shape the Data
The way data is collected is never neutral. Every collection decision is also a decision about inclusion and exclusion. Who was included? A user activity dataset from a social media platform includes only people who use that platform — skipping people who do not have internet access, people who cannot afford smartphones, elderly people who avoid social media, and many others. An AI trained only on that data will work better for the people represented than for those left out. When was it collected? Data collected during a pandemic might look very different from data collected in a normal year. Data from 2008 might not reflect 2026 behaviors. Temporal scope matters. What was measured? As you learned in Lesson 3, the choice of features is a choice about what to see and what to ignore. A hiring dataset that records job title and test scores but not hours worked is giving the AI an incomplete picture. How was it collected? A survey administered online vs. in person vs. by phone tends to attract different populations and produce different response patterns — even asking the same questions.
Every dataset is a sample of some larger population — and samples are almost never perfectly representative. The people and situations that were easiest to collect data from are always overrepresented. Those who were hardest to reach — due to geography, language, technology access, or social exclusion — are underrepresented or absent. This is not an accident; it is a structural feature of how data gets collected.
Match each data source to its most distinctive property.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A company builds a recommendation AI by training on data from its 50 million users. Which population is most likely to be underrepresented in that AI's training data?
What is 'revealed preference,' and why is user activity data valuable for capturing it?
Trace a Dataset's Source
- Choose one AI system you have heard of or used: a music recommender, a spell-checker, a speech-to-text system, a navigation app, or any other.
- Write down your best guess about where its training data came from. Which of the five sources (sensors, user activity, surveys, web crawls, curated datasets) do you think contributed?
- For each source you identified, ask: who is likely overrepresented in that source? Who is likely underrepresented or absent?
- Predict one real-world situation where the AI might work poorly, based on your analysis of its data sources.
- Share your prediction with the class. After discussion, revise your prediction if needed.