Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Overfitting and Regularization

A deep neural network can memorize its training data. Given enough parameters and enough training, it will fit every training example perfectly — including the noise and idiosyncrasies that are particular to those examples and will not appear in new data. The result is a model that performs beautifully on the training set and terribly everywhere else. This is overfitting, and avoiding it is one of the central practical challenges of deep learning. This lesson analyzes the phenomenon precisely and develops three principled remedies.

The Bias-Variance Trade-off

The expected prediction error of a model on new data decomposes into three parts: Expected Error = Bias^2 + Variance + Irreducible Noise Bias measures how far the model's average prediction is from the true answer — systematic error from underfitting. A model that always predicts the mean has high bias. Variance measures how much the model's predictions change when trained on different samples of the training data — sensitivity to the particular data seen. An overfitted model has high variance: it has memorized this training set, so small changes to the training data produce very different predictions. Irreducible noise is randomness in the data that no model can eliminate. A shallow model with few parameters has high bias (cannot express the true function) but low variance (does not overfit). A very deep model has low bias (can express complex functions) but potentially high variance (overfits). Regularization techniques reduce variance — often at the cost of a slight increase in bias — to improve overall generalization.

Generalization Is the Goal

Training loss is not the metric that matters. What matters is test loss — performance on data the model has never seen. A model that achieves 0.01 training loss and 2.5 test loss has memorized noise, not learned patterns. Monitor both throughout training. The gap between training loss and test loss is the signature of overfitting.

Three standard regularization techniques: 1. Weight Decay (L2 Regularization) Add a penalty to the loss proportional to the sum of squared weights: L_regularized = L_original + (lambda/2) * sum_i w_i^2 Hyperparameter lambda controls the strength of regularization. The gradient of the penalty with respect to w_i is lambda * w_i. The updated gradient descent rule becomes: w_new = w - eta * (dL/dw + lambda * w) = (1 - eta*lambda) * w - eta * (dL/dw) The term (1 - eta*lambda) shrinks every weight slightly toward zero at each step — 'decaying' the weights. Large weights are penalized, forcing the network to use small, distributed weights rather than relying heavily on any single feature. This discourages memorization and promotes smoother, more generalizable functions. 2. Dropout During each training forward pass, randomly set a fraction p (the dropout rate, typically 0.2 to 0.5) of neurons to zero. On the backward pass, gradients through dropped neurons are also zeroed. Dropout forces the network to learn redundant representations — it cannot rely on any single neuron always being present. The result is approximately equivalent to training an exponential ensemble of different sub-networks, which reduces variance. At test time, dropout is turned off and all neurons are active. Weights are typically scaled by (1-p) to account for the larger number of active neurons. 3. Early Stopping Monitor validation loss (loss on a held-out subset not used in training) throughout training. When validation loss stops improving and begins to increase, stop training — even if training loss is still decreasing. The point of best validation loss is the moment the model has learned the signal but has not yet started memorizing noise. Early stopping is simple, costs nothing extra, and is highly effective. The underlying intuition: in early training, the network learns broad patterns; later, it starts fitting idiosyncratic training details.

Practical Guidance

These techniques are complementary and often combined. A typical modern training setup includes: - Weight decay on all weight matrices (not biases): lambda = 1e-4 to 1e-2 - Dropout after each hidden layer, rate 0.1 to 0.5 depending on dataset size - Early stopping based on validation loss with a patience of 5-20 epochs When should you add more regularization? When training loss is much lower than validation loss (the gap is large). When should you reduce regularization (or use a larger model)? When both training and validation loss are high — the model is underfitting (high bias), and regularization is only making it worse. Data augmentation is another powerful form of regularization not covered here: artificially expanding the training set by applying label-preserving transformations (flipping, rotating, cropping images) reduces overfitting by making the training distribution less memorizable.

Prompt Challenge

Write a prompt asking an AI assistant to explain how to detect and fix overfitting in a neural network training run.

Your prompt should…

  • Ask the AI to describe two specific signals in training curves that indicate overfitting is occurring
  • Request concrete numerical thresholds or examples so the explanation is actionable
  • Instruct the AI to recommend at least two distinct regularization techniques appropriate for a high-school-level deep learning practitioner
Regularization as a Prior Belief

Weight decay encodes the prior belief that the true function should be smooth and not depend heavily on any one feature. Dropout encodes the belief that good features should be redundant, not unique. Early stopping encodes the belief that simpler, earlier-converging solutions generalize better than solutions that take longer to fit. Each technique is a form of inductive bias — an assumption baked into training that guides the model toward solutions that generalize.

A model has training loss 0.05 and validation loss 0.95 after 100 epochs. What is the most likely diagnosis, and what technique would you try first?

During training with early stopping, validation loss decreases for 30 epochs, then increases for 5 consecutive epochs. What should you do?

Spot the Overfit

  1. Step 1: Imagine you are reviewing training logs for a classmate's neural network. You see these values:
  2. Epoch 10: train_loss=0.45, val_loss=0.47
  3. Epoch 20: train_loss=0.28, val_loss=0.31
  4. Epoch 30: train_loss=0.15, val_loss=0.29
  5. Epoch 40: train_loss=0.08, val_loss=0.35
  6. Epoch 50: train_loss=0.04, val_loss=0.48
  7. Step 2: Plot these (or sketch approximately) on a graph with epoch on the x-axis and loss on the y-axis. Draw both curves.
  8. Step 3: Identify the epoch at which overfitting begins. What evidence supports your answer?
  9. Step 4: If your classmate had used early stopping with patience=5 (stop when val_loss has not improved for 5 epochs), at which epoch would training have stopped?
  10. Step 5: Recommend one additional regularization technique they should try in the next run, and explain what it will change in the training procedure.