Overfitting and Generalization
Training accuracy is not what matters. What matters is how well a model performs on data it has never seen — data from the real world, after deployment. The gap between training performance and real-world performance is the central diagnostic challenge of machine learning. Understanding why that gap exists, and how to manage it, is essential to building models that are actually useful rather than models that merely appear useful during development.
Overfitting: Memorizing Instead of Learning
Overfitting occurs when a model learns the training data so thoroughly — including its noise, its quirks, and its coincidental patterns — that it fails to generalize to new examples. The model has, in effect, memorized rather than understood. A concrete example: suppose you have 20 training points for a regression task — 20 (x, y) pairs. You fit a polynomial of degree 19, which can pass exactly through all 20 points, achieving zero training error. Now you test the model on 5 new points. The high-degree polynomial has wild oscillations between the training points and performs terribly on the new data. A simpler model — say, a straight line that does not fit the training data perfectly — might generalize far better. The symptom is unmistakable: training accuracy is very high; validation accuracy is substantially lower. The gap between them is the overfitting gap. Monitoring this gap throughout training is fundamental practice — it is why the validation set exists.
Every model's generalization error decomposes into three sources. Bias is the error from incorrect assumptions in the model — a linear model applied to a non-linear relationship has high bias. Variance is the error from sensitivity to fluctuations in the training data — a complex model that memorizes noise has high variance. Irreducible error is the noise inherent in the data that no model can eliminate. The trade-off: as you reduce bias by using more complex models, variance tends to increase. The goal is to find the complexity level that minimizes total error on unseen data.
Underfitting is the opposite of overfitting. An underfit model is too simple to capture the real patterns in the data — it has high bias. Its training error and validation error are both high, but they are similarly high. An underfit linear model applied to a clearly non-linear problem is wrong in a systematic, consistent way on all data. The model complexity sweet spot sits between underfitting and overfitting. As you increase model complexity (more layers, more trees, higher polynomial degree), training error typically decreases monotonically. Validation error decreases initially — the model is capturing real patterns — then reaches a minimum, then begins to increase as the model starts memorizing noise. The minimum validation error marks the target complexity level. Practical diagnostics: if training error is high, the model is probably underfit — try a more complex model or better features. If training error is low but validation error is high, the model is overfit — try regularization, more data, or a simpler model.
Regularization
Regularization is a family of techniques that constrain a model's complexity to reduce overfitting. The core idea: add a penalty to the loss function that grows as the model's parameters become large or numerous. This forces the optimization algorithm to trade off between fitting the training data and keeping the model simple. L2 regularization (also called ridge regression for linear models) adds the sum of squared parameter values to the loss function, multiplied by a regularization strength hyperparameter lambda. Large parameters are penalized, so the model is pushed toward smaller weights that represent smoother, more generalizable functions. L1 regularization (lasso) adds the sum of absolute values of parameters instead. L1 has a distinctive property: it drives some parameters exactly to zero, effectively removing those features from the model. This makes L1 a form of automatic feature selection. Dropout is a regularization technique specific to neural networks. During each training update, a random fraction of neurons is temporarily disabled — their outputs set to zero. This prevents neurons from co-adapting too closely to specific training examples and forces the network to develop more robust representations. Early stopping is another approach: monitor validation error during training and stop training when validation error begins to rise, even if training error is still decreasing. The saved model is the one at the validation error minimum.
All regularization techniques are workarounds for the fundamental problem: the model has more capacity than the data supports. The cleanest solution is simply more training data — a model cannot overfit patterns that are genuinely consistent across thousands of diverse examples. When data is limited, regularization manages the symptom; more data treats the cause.
Complete these statements about overfitting and regularization.
A model achieves 98% accuracy on training data and 61% on validation data. What does this pattern most clearly indicate?
L1 regularization is sometimes preferred over L2 regularization when you suspect many features are irrelevant. Why?
Diagnosing a Training Run
- A classmate shares the following training log from a neural network trained to classify product reviews as positive or negative.
- Epoch 1: Train loss 0.68, Val loss 0.67
- Epoch 5: Train loss 0.41, Val loss 0.40
- Epoch 10: Train loss 0.22, Val loss 0.28
- Epoch 15: Train loss 0.10, Val loss 0.38
- Epoch 20: Train loss 0.04, Val loss 0.51
- Step 1: Plot these values on paper (or describe the shape of the curves).
- Step 2: At which epoch does overfitting most clearly begin? Justify your answer using specific numbers.
- Step 3: If you were using early stopping, at which epoch would you stop training? Why?
- Step 4: Name two interventions — other than early stopping — that might have prevented this overfitting, and briefly explain how each works.
- Step 5: If at epoch 20 the training loss were also 0.51 instead of 0.04, what different problem would that suggest?