From Data to a Model
You now know that ML searches for a function inside a hypothesis space. But the search is not random — it is guided by evidence. That evidence is the training dataset. In this lesson we examine precisely how labeled data narrows the infinite hypothesis space down to one specific learned model, and why the nature of that data determines the quality of what is learned.
The Training Dataset as Evidence
A training dataset D consists of n input-output pairs: D = {(x1, y1), (x2, y2), ..., (xn, yn)}. Each pair (xi, yi) is a labeled example — an input xi paired with the correct output yi (its label). In supervised learning, a human or a reliable process has already determined the correct answer for each training example. Each example acts as a constraint. If you observe that an apartment with 900 sq ft, 2 bedrooms, and 0.4 miles from transit rents for $2,400, that constrains the plausible parameter values. The hypothesis f(900, 2, 0.4) = 1,000 is now much less plausible; f(900, 2, 0.4) = 2,450 is more plausible. With many examples, many hypotheses become implausible. The training process is mechanistically finding parameter values that make the model's predictions consistent with the training evidence — which is equivalent to locating the region of the hypothesis space compatible with the data.
Each training example (x, y) eliminates hypotheses from the space — those that predict something far from y for input x become untenable. Enough examples, well-chosen, should constrain the space to a small region containing good approximations to the true function.
To see this geometrically for a linear model: suppose you have one data point (x=2, y=5). This constrains m and b by the equation 2m + b = 5. That is one equation in two unknowns — infinitely many lines satisfy it (the constraint is a line in (m, b) space, not a point). Add a second data point (x=4, y=9). Now you have the system: 2m + b = 5 4m + b = 9 Subtracting: 2m = 4, so m = 2. Then b = 5 − 2(2) = 1. The hypothesis space has collapsed to a single point: f(x) = 2x + 1. Two data points exactly determine a line. This is the geometric intuition — each data point is a constraint; sufficient constraints pin down a unique solution. In high-dimensional spaces with millions of parameters, you need far more data, and the constraints are statistical rather than exact — but the principle is identical.
Complete the key statements about training data.
Data Quality, Quantity, and Distribution
Not all data is equally informative. Three properties matter critically. Representativeness: the training data must reflect the distribution of inputs the model will encounter in deployment. A spam filter trained only on English emails will be poorly constrained on Arabic emails — the training data provides no evidence about that part of the input space. Label quality: if the labels yi are wrong — mislabeled spam as not-spam, incorrect medical diagnoses, biased human annotations — the training process optimizes toward incorrect targets. Garbage in, garbage out is not a cliche; it is a mathematical certainty. Quantity relative to complexity: a linear model with 2 parameters can be well-constrained by 20 examples. A neural network with 10 million parameters needs many orders of magnitude more data to constrain its hypothesis space meaningfully. The ratio of data quantity to model complexity is a fundamental design consideration. A concrete failure: in 2015, a widely publicized skin-cancer classifier performed superbly on training data but was later found to have learned an unexpected signal — many malignant-lesion photos in the dataset included rulers (placed by dermatologists to show scale), while benign-lesion photos rarely did. The model learned 'ruler → malignant' rather than the lesion's actual texture. The data was representative of how photos were taken in clinics, not of the medical concept the team intended to capture. This is called a spurious correlation — a pattern that holds in training data but not in the world.
A model will learn whatever pattern in the training data minimizes its loss — it cannot distinguish a genuine signal from a spurious one. Detecting and removing spurious correlations requires deep domain knowledge and careful data auditing, not just more data.
A team trains a dog-vs-cat classifier and achieves 99% accuracy on their training set. They then test it on photos taken indoors and find accuracy drops to 72%. The most likely explanation is:
Two linear models are fit to the same 1,000-point dataset. Model A has 3 parameters; Model B has 3,000 parameters. Assuming perfect optimization, which will have lower training error and why?
Design a Training Dataset
- Your team wants to build a model that predicts whether a student will pass a mathematics exam based on input features you choose.
- Step 1: Decide on 4 input features you would collect. Write the function signature.
- Step 2: For each feature, describe one way it could be unrepresentative — a scenario where your collected data does not match the real distribution of students who would use the model.
- Step 3: Propose one label-quality problem that might arise (for example: what defines 'pass'? who assigns the label?).
- Step 4: Estimate how many examples you would want before training. Justify your estimate relative to your number of parameters.
- Step 5: Describe one spurious correlation your model might learn if your data collection is flawed.
- Share and critique each other's designs.