Backpropagation
Gradient descent needs the gradient of the loss with respect to every parameter in the network. For a network with millions of weights across many layers, computing those gradients efficiently is not obvious. The output layer's weights are straightforward — the loss depends directly on the output. But what about the weights in the first hidden layer, which connect only to the input? The loss depends on those weights only indirectly, through every subsequent layer. Backpropagation — short for backward propagation of errors — is the algorithm that solves this problem exactly and efficiently, using the chain rule from calculus. It is arguably the most important algorithm in the history of modern machine learning.
The Chain Rule: The Mathematical Core
Recall from calculus: if y = f(g(x)), then dy/dx = (dy/dg) × (dg/dx). This is the chain rule for composition of functions. The derivative of a composed function is the product of the derivatives along the chain. A neural network is a deeply composed function. The loss L is a function of the output, which is a function of the last hidden layer, which is a function of the second-to-last hidden layer, and so on back to the inputs. The chain rule lets us propagate the derivative of L back through this chain layer by layer. Concrete example (two-layer network). Let the forward pass be: z₁ = w₁ · x + b₁ (hidden layer pre-activation) a₁ = ReLU(z₁) (hidden layer output) z₂ = w₂ · a₁ + b₂ (output layer pre-activation) L = (z₂ – y)² (MSE loss, single example) To find dL/dw₁ (how the loss changes with respect to the first-layer weight): dL/dz₂ = 2(z₂ – y) (derivative of loss w.r.t. output) dz₂/da₁ = w₂ (derivative of output w.r.t. hidden activation) da₁/dz₁ = 1 if z₁ > 0, else 0 (derivative of ReLU) dz₁/dw₁ = x (derivative of pre-activation w.r.t. weight) By the chain rule: dL/dw₁ = (dL/dz₂) × (dz₂/da₁) × (da₁/dz₁) × (dz₁/dw₁) = 2(z₂ – y) × w₂ × [1 if z₁ > 0 else 0] × x This chain of multiplications is backpropagation for one weight in a simple network. For a network with millions of weights, the same principle applies — the chain extends deeper, but no new ideas are required.
Naive computation of all N gradients would require N separate passes through the network — one per parameter. Backpropagation's power is that it computes ALL gradients in a SINGLE backward pass by reusing intermediate results. The gradient of the loss with respect to any layer's activations is computed once and then used by all previous layers. This makes computing the gradient as cheap as a single forward pass — a factor-of-N speedup that makes training practical at scale.
What Backpropagation Actually Does, Step by Step Step 1 (Forward pass): run the input through the network, recording the value of every intermediate activation and pre-activation. These are saved because they are needed during the backward pass. Step 2 (Compute loss): compare the output to the true label using the loss function. Compute the gradient of the loss with respect to the network's raw output. Step 3 (Backward pass): starting from the output layer, apply the chain rule to propagate the gradient backward, one layer at a time. For each layer: a. Multiply the incoming gradient by the derivative of the activation function (element-wise). b. Use the result to compute the gradient with respect to that layer's weights and biases. c. Pass the gradient further back to the previous layer by multiplying by the weight matrix (transposed). Step 4 (Update): use the collected gradients to update every weight and bias using gradient descent. This four-step loop — forward, loss, backward, update — is one training step. Training consists of repeating it for many minibatches across many epochs (passes through the dataset).
The Vanishing and Exploding Gradient Problems
Backpropagation multiplies derivatives through every layer. If each derivative is slightly less than 1 (as it tends to be in the flat regions of sigmoid), the product shrinks exponentially with depth. With 20 layers, each multiplying by 0.8, the overall gradient is 0.8²⁰ ≈ 0.012 — tiny. The earliest layers receive near-zero gradients and barely learn. This is the vanishing gradient problem. Conversely, if derivatives are consistently greater than 1, the product grows exponentially — gradients explode, causing unstable updates and NaN (not-a-number) values. This is the exploding gradient problem. The field has developed several effective solutions: ReLU activations: slope is exactly 1 in the positive region, preventing shrinkage for positive activations. Batch normalization: normalizes activations within each layer to keep them in a healthy range. Residual connections (skip connections): used in ResNets and Transformers, they create direct paths for gradients to flow back without passing through every activation. Careful weight initialization: schemes like Xavier (Glorot) and He initialization set starting weights to keep activations and gradients in a stable range from the first step. These solutions make training networks dozens or hundreds of layers deep feasible — and they were developed over decades of careful empirical and theoretical work.
Backpropagation relies on computing derivatives at every layer. This requires every component of the network to be differentiable (or at least subdifferentiable, as with ReLU at zero). Components that are not differentiable — like argmax, sorting, or hard thresholds — cannot be used directly inside a network trained with backprop. Researchers use differentiable approximations or specialized gradient estimators to handle these cases. This constraint silently shapes which operations appear in neural network architectures.
Match each backpropagation step to what it computes.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Why is backpropagation more efficient than computing each gradient separately?
What happens to gradients in a very deep network using sigmoid activations, and why?
Trace a Gradient Backward
- Use the simple two-layer example from this lesson. Set concrete values:
- x = 2.0, w₁ = 0.5, b₁ = 0.0, w₂ = 1.5, b₂ = 0.0, true label y = 3.0
- Step 1 (Forward pass): compute z₁, a₁ = ReLU(z₁), z₂, and L = (z₂ – y)².
- Step 2 (Backward pass): compute each derivative in the chain:
- dL/dz₂ = 2(z₂ – y)
- dz₂/da₁ = w₂
- da₁/dz₁ = 1 (since z₁ > 0)
- dz₁/dw₁ = x
- Then dL/dw₁ = product of all four.
- Step 3 (Update): with learning rate α = 0.1, compute w₁_new = w₁ – α × dL/dw₁.
- Step 4: redo the forward pass with w₁_new. Did the loss decrease?
- Write two sentences explaining why the gradient has the sign it does, and whether your update moved the weight in the expected direction.