Testing Honestly
You have built a model, trained it for many epochs, and watched the loss fall. Now comes the moment of truth: how well will it actually perform when it meets data it has never seen before? Answering that question honestly — without accidentally cheating — turns out to require careful discipline. This lesson is about why honest evaluation is harder than it looks, and how to do it right.
The Three Splits: Train, Validation, Test
Most machine learning projects divide their labeled data into three non-overlapping parts, each with a distinct role. Training set: the data the model learns from. Weights are updated based entirely on this set. It is typically the largest portion — often 70-80% of all data. Validation set (also called the development set or dev set): a held-out slice the model never trains on, used during development to check for overfitting and to make decisions like when to stop training, or which hyperparameters to try next. Typically 10-15% of data. Test set: the final exam. The model sees this data exactly once — after all training decisions have been made and the model is declared done. Its performance here is the honest, reported result. Typically 10-15% of data. The critical rule: the test set must never influence any training decision. The moment you look at test-set performance and use it to adjust anything — you have contaminated the test set, and your reported result is no longer honest.
Training set: the model learns from it. Validation set: you (the practitioner) learn from it — to tune the model. Test set: nobody learns from it. It is the final, unbiased verdict on whether the model truly generalizes.
Here is a concrete scenario showing how cheating sneaks in. You train a model. You evaluate it on the test set — 72% accuracy. You feel it is not good enough, so you adjust a hyperparameter and retrain. You check the test set again — 75%. You keep iterating until the test set shows 91%. Here is the problem: those 91% means almost nothing now. By repeatedly checking the test set and making decisions based on it, you have effectively trained on the test set. The model has been indirectly optimized for those specific examples. When it meets truly unseen data in production, it will likely perform far worse than 91%. The solution is the validation set: make all tuning decisions based on validation performance. Save the test set for a single final evaluation. Report that number. Done.
Data Leakage: A Subtler Contamination
Data leakage is when information from the test set (or future data) sneaks into the training process in a way you did not intend. It is the most dangerous trap in applied machine learning — because the model looks great until it hits the real world. Example: you have medical data. Before splitting into train/test, you compute the average age across all patients and use it as a feature. The average includes test-set patients, so the test set has quietly influenced the training features. Result: inflated performance that disappears in production. Prevention: always split your data first. Only then compute statistics, normalize features, or perform any processing — and compute them using only the training set. Apply the same transformation to the validation and test sets, but never fit those transformations on them.
Look at the test set exactly once: after all decisions are final. Any decision made based on test-set performance — stopping training, adjusting hyperparameters, adding features — is cheating and inflates your reported results.
Complete the description of the three data splits.
A team trains five different models, each time checking the test set and picking the best one. What is the main problem with their approach?
Which of the following is an example of data leakage?
Design Your Own Split
- Step 1: Imagine you have a dataset of 1,000 labeled photos of fruits (apples, oranges, bananas).
- Step 2: Decide how many photos go into each split (training, validation, test) and write down your numbers and percentages.
- Step 3: List three decisions you would make using only the validation set (not the test set) during model development.
- Step 4: Describe what you would do with the test set, and when.
- Step 5: Invent a scenario where data leakage could occur in this fruit-classification project and describe how you would prevent it.