Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

The Objective: Minimizing Loss

Training data constrains the hypothesis space, but we still need a precise, computable criterion for which hypothesis is best. We need to turn 'fits the data well' into a number that a computer can minimize. That number is the loss — and the function that computes it is the loss function. The choice of loss function encodes your values: what kinds of errors you care about and how much.

What Is a Loss Function?

A loss function L(y_hat, y) takes two arguments — the model's prediction y_hat and the true label y — and returns a non-negative number measuring how wrong the prediction is. A perfect prediction yields loss 0; wrong predictions yield positive loss; the worse the prediction, the higher the loss. The total training loss is the average (or sum) of the per-example losses over the entire training dataset: L_total = (1/n) * sum over i of L(f(xi), yi) Training is the process of searching for parameter values that minimize L_total. This is why the loss function is called the objective — it is the thing the algorithm is literally trying to optimize. Two loss functions are foundational and worth knowing precisely. Mean Squared Error (MSE): used for regression (predicting a continuous number). MSE = (1/n) * sum of (y_hat_i - y_i)^2 Example: true prices are [200k, 350k, 500k]; predictions are [210k, 340k, 480k]. Errors: [10k, -10k, -20k]. Squared errors: [100M, 100M, 400M]. MSE = 600M / 3 = 200,000,000 (dollars squared). Root MSE = $14,142. Cross-Entropy Loss: used for classification (predicting a probability over categories). For binary classification: L = −[y * log(y_hat) + (1−y) * log(1−y_hat)] If the true label is 1 (spam) and the model outputs probability 0.9, loss = −log(0.9) ≈ 0.105. Low. If the true label is 1 and the model outputs 0.1, loss = −log(0.1) ≈ 2.303. High — severely penalizes confident wrong predictions.

Loss Functions Encode Values

MSE penalizes large errors quadratically — a prediction 20 units off contributes 4× the loss of a prediction 10 units off. This is sometimes wrong: in medical diagnosis, a large false negative (missing a disease) and a large false positive (false alarm) may have asymmetric real-world costs. Choosing a loss function is a value judgment, not just a technical convenience.

A subtlety: minimizing training loss does not mean finding the true function. It means finding the hypothesis in your hypothesis space that best explains the training data according to your loss function. If your hypothesis space does not contain the true function, the minimum-loss hypothesis is still an approximation. If your training data is unrepresentative, minimizing training loss can move you away from the true function on real-world inputs. This separation — between the objective (minimize training loss) and the goal (perform well on future data) — is one of the deepest tensions in machine learning. We will examine it closely in Lesson 5. A useful mental model: the loss function is a landscape. Each point in the landscape corresponds to one hypothesis (one set of parameter values). The height of the landscape at any point is the training loss for that hypothesis. Training is the problem of finding the lowest point in the landscape. Lesson 7 will show how gradient descent navigates this landscape.

Match each loss scenario to the correct loss function or concept.

Terms

Predicting apartment price in dollars
Predicting spam vs. not-spam probability
The number training is designed to minimize
Loss when prediction exactly equals true label
Penalizes confident wrong predictions most severely

Definitions

Total training loss
Zero
Cross-Entropy Loss property
Cross-Entropy Loss
Mean Squared Error

Drag terms onto their definitions, or click a term then click a definition to match.

The Loss Landscape and Multiple Minima

For a linear model with MSE loss, the loss landscape has a pleasing property: it is convex — shaped like a bowl — meaning there is exactly one global minimum, reachable by any competent optimizer. For deep neural networks with millions of parameters, the loss landscape is high-dimensional and complex. It has many local minima (valleys that are not the lowest point globally), saddle points (flat regions that curve upward in some directions), and plateaus (vast flat regions where the gradient gives little information about which direction to move). Navigating this landscape efficiently is the problem of optimization, to be explored in Lesson 7. A remarkable empirical finding: in modern large neural networks, many local minima have nearly the same loss value and generalize comparably well. This suggests the landscape, while complex, may be more navigable than early theory feared. But this is an active area of research — our theoretical understanding of why large neural networks train successfully is still incomplete.

Honest Uncertainty

Researchers still lack a complete theoretical account of why stochastic gradient descent finds good solutions in the complex loss landscapes of large neural networks. If you read someone claiming to fully understand this, read carefully — the field does not yet.

A regression model predicts housing prices. For one house, the true price is $400,000 and the model predicts $390,000. For another, the true price is $400,000 and the model predicts $380,000. Under MSE, what is the ratio of the second error's contribution to the first error's?

A binary classifier outputs probability 0.51 for the positive class when the true label is positive. A second classifier outputs probability 0.99. Under cross-entropy loss, which has lower loss and what does this tell us about the loss function?

Compute and Compare Losses

  1. You are comparing two regression models predicting weekly rainfall in millimeters. True values for five weeks: [10, 25, 5, 40, 15].
  2. Model A predictions: [12, 23, 6, 38, 18]
  3. Model B predictions: [11, 20, 10, 50, 12]
  4. Step 1: Compute the error (prediction minus true) for each week for both models.
  5. Step 2: Compute the squared error for each week.
  6. Step 3: Compute the MSE for both models.
  7. Step 4: Which model has lower MSE? Does that match your intuition from looking at the errors?
  8. Step 5: Model B has a large error in week 4 (predicted 50, true 40). How much does that single error inflate Model B's MSE? What does this tell you about MSE and outlier predictions?
  9. Bonus: Propose a loss function that would penalize Model A and Model B equally for errors of equal absolute magnitude. (Hint: consider not squaring.)