Loss Functions
A neural network starts with random weights and makes meaningless predictions. Training is the process of adjusting weights to make better predictions. But 'better' needs a precise, numerical definition — the network must have a single number to minimize. That number is the loss. The loss function is the mathematical lens through which the network perceives the gap between what it predicted and what actually happened. This lesson defines the two most important loss functions: mean squared error and cross-entropy.
Mean Squared Error (MSE)
Mean squared error is used for regression — problems where the target y is a real number (predicting a house price, a temperature, a stock value). For a single training example with true value y and prediction y_hat: L = (y - y_hat)^2 For N training examples: L = (1/N) * sum_{i=1}^{N} (y_i - y_hat_i)^2 Concrete example: suppose we are predicting exam scores. The true scores and predictions for 4 students are: y = [85, 72, 90, 60] y_hat = [80, 75, 88, 65] Squared errors: (85-80)^2 = 25 (72-75)^2 = 9 (90-88)^2 = 4 (60-65)^2 = 25 MSE = (25 + 9 + 4 + 25) / 4 = 63 / 4 = 15.75 Squaring serves two purposes: it makes all errors positive (so over-predictions and under-predictions do not cancel), and it penalizes large errors more than small ones — a prediction 10 units off contributes 100, while a prediction 1 unit off contributes only 1.
Squaring errors does two things simultaneously: it eliminates sign (positive and negative errors do not cancel each other), and it penalizes large deviations disproportionately. A model that is slightly wrong everywhere is penalized less than one that is catastrophically wrong on a few examples. This matches the intuition that big mistakes are worse than many small ones.
Binary Cross-Entropy is used for binary classification — problems where the target y is 0 or 1. The network outputs a probability p = y_hat in (0, 1) via sigmoid. For one example: L = - [y * log(p) + (1 - y) * log(1 - p)] This formula may look intimidating, but it has a clean interpretation. Consider two cases: Case 1: y = 1 (true class is 'yes'). Then L = -log(p). If the network is confident and correct (p = 0.95): L = -log(0.95) ≈ 0.051. Very low loss. If the network is confident but wrong (p = 0.05): L = -log(0.05) ≈ 3.00. Very high loss. Case 2: y = 0 (true class is 'no'). Then L = -log(1-p). If p = 0.05 (predicts 'no' confidently, correctly): L = -log(0.95) ≈ 0.051. If p = 0.95 (predicts 'yes' confidently, wrongly): L = -log(0.05) ≈ 3.00. The logarithm punishes confident wrong predictions with an extremely large loss — approaching infinity as the wrong prediction approaches certainty. This is exactly what we want: confidence should only be rewarded when it is correct.
Cross-Entropy for Multiple Classes
For K-class classification (the output is a probability vector p of length K, summing to 1, produced by softmax), the categorical cross-entropy loss for one example is: L = - sum_{k=1}^{K} y_k * log(p_k) where y is a one-hot vector (all zeros except a 1 at the true class index). This simplifies to: L = - log(p_{true class}) Example: a 3-class problem (cat, dog, bird). True class is dog (index 1). Network output: p = [0.1, 0.7, 0.2]. L = -log(0.7) ≈ 0.357. If instead p = [0.05, 0.9, 0.05]: L = -log(0.9) ≈ 0.105. Less loss, more confident and correct. If instead p = [0.05, 0.1, 0.85]: L = -log(0.1) ≈ 2.30. Much more loss, confident and wrong. The loss the network minimizes in training is typically the average of L over all training examples. This average is called the training loss or empirical risk.
Fill in the blanks.
Classification accuracy (fraction of correct predictions) is what we care about, but accuracy is not differentiable — it does not change smoothly when weights change slightly. Cross-entropy is a smooth, differentiable proxy that we can minimize with gradient descent. A model minimizing cross-entropy indirectly optimizes accuracy, but loss and accuracy can move differently, and you should track both during training.
A binary classifier outputs p = 0.9 for an example with true label y = 0. What is the cross-entropy loss, and what does this tell you?
For a regression problem with true value y = 100 and two models with predictions 95 and 110 respectively, which model has higher MSE loss, and why?
Design a Loss Scenario
- Step 1: Imagine a medical diagnostic model that predicts whether a patient has a serious disease (y=1) or not (y=0).
- Step 2: Describe two error types: a false negative (predicts 'no disease' when disease is present) and a false positive (predicts 'disease' when none is present).
- Step 3: For each error type, compute the cross-entropy loss when the model's predicted probability is 0.05 (highly confident wrong prediction). Use L = -log(p) for false negatives and L = -log(1-p) for false positives.
- Step 4: Discuss: in the medical context, which type of error is more dangerous? Does the symmetric cross-entropy loss capture this asymmetry, or would a different loss function be needed?
- Step 5: Research or brainstorm: what is a 'weighted loss' and how could it address your concern?