Skip to main content
AI Foundations

⏱ About 20 min20 XP

Loss Functions

The forward pass produces a prediction. But a prediction alone is not learning — you also need to know how wrong the prediction was. The loss function (also called the cost function or objective function) is a mathematical formula that takes the network's prediction and the true answer and returns a single number representing the error. Smaller is better; zero would mean perfect. The entire training procedure exists to minimize this number. Choosing the right loss function is one of the most consequential design decisions you make when building a neural network.

Mean Squared Error: Loss for Regression

When the task is to predict a continuous quantity — a house price, tomorrow's temperature, the velocity of a particle — the most common choice is Mean Squared Error (MSE). For a single training example, if the network predicts ŷ and the true value is y, the squared error is: L = (ŷ – y)² For a batch of N examples with predictions ŷ₁, ŷ₂, …, ŷₙ and true values y₁, …, yₙ: MSE = (1/N) × Σᵢ (ŷᵢ – yᵢ)² Concrete example. Three examples: Example 1: predicted 4.2, true 4.0 → squared error (0.2)² = 0.04 Example 2: predicted 7.5, true 6.0 → squared error (1.5)² = 2.25 Example 3: predicted 3.1, true 3.5 → squared error (–0.4)² = 0.16 MSE = (0.04 + 2.25 + 0.16) / 3 = 2.45 / 3 ≈ 0.817 The squaring serves two purposes: it makes all errors positive (so overestimates and underestimates do not cancel), and it penalizes large errors disproportionately. A prediction that is off by 3 units contributes 9 to the loss — four times more than a prediction off by 1.5. This means the network's training will be especially pulled toward reducing large errors.

The Loss as a Landscape

Think of the loss as a hilly landscape in the space of all possible parameter values. Every combination of weights and biases corresponds to a point on this landscape, and the height at that point is the loss the network achieves with those parameters. Training is the process of navigating this landscape toward a low point — ideally a global minimum, but in practice a 'good enough' local minimum. The geometry of this loss landscape determines how easy or hard training is.

Binary Cross-Entropy: Loss for Classification When the task is binary classification — is this email spam or not? is this tumor malignant? — the sigmoid output of the network is interpreted as a probability p of the positive class. The binary cross-entropy loss for one example is: L = –[y · log(p) + (1 – y) · log(1 – p)] where y is 1 for the positive class and 0 for the negative class. Concrete example: Case 1: true class y = 1, predicted probability p = 0.9 L = –[1 · log(0.9) + 0 · log(0.1)] = –log(0.9) ≈ 0.105 (low loss — good prediction) Case 2: true class y = 1, predicted probability p = 0.1 L = –[1 · log(0.1) + 0 · log(0.9)] = –log(0.1) ≈ 2.303 (high loss — bad prediction) The logarithm is the key. When p is close to 1 and y = 1, log(p) is close to 0, so the loss is small. When p is close to 0 and y = 1, log(p) is a large negative number, so –log(p) is a large positive loss. The cross-entropy loss thus heavily penalizes confident wrong predictions — the most dangerous kind of error.

Why Loss Function Choice Matters

The loss function defines what the network is optimizing. Optimize the wrong thing and the network may achieve a low number on your loss while failing at your actual goal. A famous example from practice: in some early recommendation systems, the loss was based on predicting ratings. But minimizing rating-prediction error did not maximize user satisfaction or engagement — it merely minimized a number that was only loosely connected to the real objective. The network became good at the proxy task, not the actual one. This phenomenon — Goodhart's Law — states that when a measure becomes a target, it ceases to be a good measure. There is also a mathematical reason: the loss function must be differentiable (or mostly differentiable) with respect to the weights, because the training algorithm requires the gradient of the loss. This is why the hard 0-or-1 classification accuracy — which sounds like a natural choice — is not used as the training loss. Accuracy is not differentiable; tiny changes in a weight usually produce no change in accuracy at all, making it useless as a training signal. Cross-entropy and MSE, by contrast, are smooth and provide a useful gradient at every point.

Training Loss vs. Evaluation Metric

The loss function used during training and the metric reported to stakeholders are often different quantities. You might train with cross-entropy loss but report accuracy, F1 score, or AUC-ROC. This is normal and correct — use the differentiable loss for training, use the business-meaningful metric for evaluation. Confusing the two leads to poor model selection and misleading performance claims.

Complete the statement about loss functions.

In binary cross-entropy, the loss is high when the model outputs a probability for the correct class, and the function is chosen over raw accuracy because accuracy is not .

Why does Mean Squared Error penalize large errors more heavily than small ones?

A network trained on binary cross-entropy outputs p = 0.05 for an example with true label y = 1. Which description best characterizes the loss for this example?

Design a Loss Function

  1. You are building a system to predict whether a loan applicant will default. The prediction is a probability p between 0 and 1.
  2. List two different ways the prediction could be wrong and describe which type of error is more costly in the real world.
  3. Explain, in your own words, why binary cross-entropy is better suited for this task than Mean Squared Error.
  4. Now consider: the bank wants to catch 99% of likely defaulters, even if it means rejecting some safe loans. Discuss whether training on cross-entropy loss alone will necessarily optimize for this goal. What other tools (beyond the loss function) might help align the model with this business objective?
  5. Write a one-paragraph recommendation to the engineering team.