Gradient Descent
You have a network, a forward pass, and a loss function that measures how wrong the predictions are. Now comes the central question of training: how do you adjust the millions of weights and biases so the loss goes down? The answer is gradient descent — one of the most important algorithms in all of modern computing. The idea is conceptually elegant: at each step, compute the direction in which the loss is increasing fastest (the gradient) and take a small step in the opposite direction. Repeat this until the loss is acceptably low. That is it. The power lies in how this simple idea, combined with backpropagation (the next lesson), scales to billions of parameters.
The Geometry of the Gradient
Imagine a single weight w and a loss L(w) that depends on it. If you plot L against w, you get a curve — perhaps a valley shape. The gradient dL/dw at a given value of w is the slope of this curve: positive when the curve is rising to the right, negative when it is falling to the right, and zero at the bottom of the valley. Gradient descent update rule for a single weight: w_new = w_old – α × (dL/dw) α (alpha) is the learning rate — a small positive constant you choose, typically in the range 0.0001 to 0.1. Concrete example. Suppose w = 3.0, the gradient dL/dw = 2.5, and the learning rate α = 0.1. w_new = 3.0 – 0.1 × 2.5 = 3.0 – 0.25 = 2.75 The weight moved to the left (decreased) because the gradient was positive — the loss was rising in the positive direction, so stepping left decreases it. After many such steps, if the loss surface is shaped like a bowl, w will settle near the bottom. With N parameters (weights and biases), the gradient is a vector of N partial derivatives — one for each parameter. Each parameter is updated simultaneously using its own gradient component. The update rule is: θ_new = θ_old – α × ∇L(θ) where θ is the full parameter vector and ∇L is the gradient vector.
The gradient ∇L points in the direction of steepest increase of the loss. Gradient descent steps in the negative gradient direction — the direction of steepest decrease. This is sometimes described as 'rolling downhill' on the loss landscape. The metaphor is useful but imperfect: the loss landscape in a real neural network has millions of dimensions, and its geometry is far more complex than a simple valley.
The Learning Rate: A Critical Hyperparameter The learning rate α controls how large each step is. It is not learned from data — you set it before training begins. It is a hyperparameter. Too large: if α is too large, each step overshoots the minimum. The loss bounces around or even increases. Imagine a ball bouncing back and forth across a valley, never settling. Too small: if α is too small, each step makes almost no progress. Training converges eventually, but requires an impractical number of steps. Imagine edging down a mountain one centimeter at a time. Just right: a well-chosen learning rate takes steps large enough to make progress but small enough to converge stably. Common practice is to start with a moderate value and use a learning rate schedule — reducing α over time so the network takes large steps early (fast progress) and small steps late (fine-tuning). Why not just use a very small learning rate and be safe? Because training on a large dataset with billions of parameters requires millions of update steps. Each step costs computation. A learning rate that is 10× too small means training takes 10× as long — often the difference between feasible and infeasible. Minibatch Gradient Descent Computing the true gradient requires the loss across the entire dataset — expensive for datasets with millions of examples. Stochastic Gradient Descent (SGD) uses one random example at a time, which is noisy but fast. In practice, almost all modern training uses minibatch gradient descent: compute the gradient on a random subset (batch) of examples — typically 32 to 512 — and update weights after each batch. This balances computation cost and gradient accuracy.
Local Minima, Saddle Points, and Why They Matter Less Than You Think
A natural worry: what if gradient descent finds a local minimum — a valley that is not the deepest valley — and gets stuck there? For networks with only a few parameters, local minima are a real concern. For very high-dimensional networks with millions of parameters, research has found that true local minima are surprisingly rare. Most 'stuck' points are saddle points — positions where the loss is at a minimum in some dimensions but a maximum in others. Gradient descent with some randomness (from minibatch sampling) tends to escape saddle points. More importantly: in practice, most local minima in deep networks achieve similar loss values, and many of them generalize well to new data. The question 'did we find the global minimum?' turns out to be less important than 'did we find parameters that work well on unseen data?' The latter is what validation sets measure.
Gradient descent minimizes the loss function you specify — nothing more. If the loss does not capture the full nuance of the task (fairness, robustness, safety), the optimized model will not capture it either. This is a fundamental limitation: the algorithm is a mathematically rigorous optimizer, but it has no concept of what 'good' means beyond the loss signal you give it.
Flashcards — click each card to reveal the answer
A weight currently has value w = 5.0. The gradient dL/dw = –3.0 and the learning rate is 0.2. What is the updated weight?
What most likely happens when the learning rate is set too high?
Gradient Descent by Hand on a Parabola
- Consider the loss function L(w) = (w – 2)², which has its minimum at w = 2.
- Start at w = 6.0 and use learning rate α = 0.3.
- Compute the gradient: dL/dw = 2(w – 2). At w = 6.0: dL/dw = 2(6.0 – 2) = 8.0.
- Update: w_new = 6.0 – 0.3 × 8.0 = 6.0 – 2.4 = 3.6.
- Repeat from the new w. Do five complete steps, recording w and L(w) = (w–2)² after each step.
- Draw a rough graph of L(w) vs. step number. Does it decrease monotonically?
- Now redo the simulation with α = 1.2 (too large). Record what happens to w and L across five steps. Describe in one sentence what you observe.