Optimization as Hill-Descending
You know what a loss function is and why minimizing it is the objective of training. What remains is the how: given a loss landscape that might have millions of dimensions, how does the training algorithm actually find a low point? The answer is gradient descent — a deceptively simple algorithm that, with variations, underlies virtually all modern machine learning. Understanding it precisely gives you insight into every training run you will ever observe.
The Gradient: Direction of Steepest Ascent
The gradient of the loss function with respect to the parameters is a vector that points in the direction of steepest increase in loss. Formally, for a function L of parameters (w1, w2, ..., wk), the gradient is: grad L = (dL/dw1, dL/dw2, ..., dL/dwk) Each component dL/dwi is the partial derivative — the rate at which the loss changes as wi is nudged slightly while all other parameters are held fixed. If the gradient at the current parameters is (3.2, -1.1, 0.7), it means: increasing w1 slightly will increase loss by approximately 3.2 units per unit change; increasing w2 will decrease loss by 1.1 units per unit change; increasing w3 will increase loss by 0.7 units per unit change. To go downhill — to reduce loss — move in the direction opposite to the gradient. This is gradient descent. The gradient descent update rule for each parameter wi: wi_new = wi_old - learning_rate * dL/dwi The learning rate (often written as eta or alpha) is a small positive scalar — typically between 0.0001 and 0.1 — that controls how large a step is taken in the downhill direction. It is not learned from data; it is set by the practitioner and is therefore called a hyperparameter.
wi_new = wi_old minus (learning_rate times dL/dwi). Move each parameter in the direction that decreases loss, by an amount proportional to how steeply the loss rises in that parameter's direction. Repeat until loss stops decreasing meaningfully.
A one-dimensional worked example: Suppose the loss is L(w) = (w - 3)^2 — a parabola with minimum at w = 3. Starting at w = 0: dL/dw = 2(w - 3). At w = 0: gradient = 2(0-3) = -6. Update: w = 0 - 0.1 * (-6) = 0 + 0.6 = 0.6. At w = 0.6: gradient = 2(0.6-3) = 2(-2.4) = -4.8. Update: w = 0.6 - 0.1 * (-4.8) = 0.6 + 0.48 = 1.08. Each step moves w toward 3, converging on the minimum. With learning rate 0.1 and this loss function, the algorithm converges smoothly. If learning rate were 1.0, the update at w=0 would be w = 0 - 1.0 * (-6) = 6 — overshooting the minimum. If learning rate were 2.0, the algorithm would diverge: w = 0 - 2.0 * (-6) = 12, then 12 - 2.0 * 2(12-3) = 12 - 36 = -24, oscillating away from the minimum. Learning rate selection is therefore not trivial. Too small: training is extremely slow. Too large: training diverges or oscillates. A common heuristic is to start with a moderate learning rate and reduce it if training is unstable.
Complete the gradient descent update rule.
Stochastic and Mini-Batch Gradient Descent
In the formulation above, the gradient is computed over the entire training dataset — every single example — before one update step is taken. This is called batch gradient descent. For datasets with millions of examples, computing the full gradient before every step is impractically slow. The solution is stochastic gradient descent (SGD): estimate the gradient using a single randomly selected example, or more practically, a mini-batch of examples (typically 32, 64, or 256 examples). The mini-batch gradient is a noisy but unbiased estimate of the true gradient over the full dataset. SGD has an important side effect: the noise in mini-batch gradient estimates causes the parameter trajectory to fluctuate. This is not purely a problem — the noise helps the optimizer escape shallow local minima and saddle points, which would trap a noiseless optimizer. In complex loss landscapes, SGD's noise is often beneficial. Modern optimizers (Adam, AdamW, RMSProp) refine SGD by adapting the learning rate separately for each parameter based on the history of its gradients. Adam, the most common, maintains running estimates of the first and second moments of each parameter's gradient and uses these to normalize updates. The practical effect is that Adam often converges faster and more reliably than vanilla SGD, especially in early training. Backpropagation is the algorithm that makes gradient computation tractable for neural networks with millions of parameters. It applies the chain rule of calculus layer by layer, efficiently computing the gradient of the loss with respect to every parameter in O(number of parameters) time — the same order as a single forward pass. Without backpropagation, modern deep learning would be computationally impossible.
Mini-batch gradient descent adds noise to the optimization trajectory. In practice, this noise often helps find better solutions by preventing the optimizer from getting stuck in sharp minima that do not generalize well. The interaction between SGD noise and generalization is an active research topic with practical implications for batch size selection.
A training run shows that loss decreases steadily for 10 steps, then suddenly explodes to a very large number and stays there. The most likely cause is:
Why does computing gradients on mini-batches of 64 examples rather than all 1,000,000 training examples work well in practice?
Hand-Simulate Gradient Descent
- You will manually run gradient descent on a simple loss function to build visceral intuition.
- Loss function: L(w) = w^2 - 4w + 5 (minimum at w = 2, where L = 1)
- Gradient: dL/dw = 2w - 4
- Starting point: w = 0
- Learning rate: 0.3
- Step 1: Compute L(0) and dL/dw at w=0.
- Step 2: Apply the update: w_new = w - 0.3 * gradient.
- Step 3: Record (step, w, L(w)) in a table.
- Step 4: Repeat for 8 total steps.
- Step 5: Plot w vs. step number and L(w) vs. step number on a graph.
- Extension A: Repeat with learning rate 1.5. What happens? Why?
- Extension B: Repeat with learning rate 0.05. How many steps to get within 0.01 of the minimum?
- Discuss: what does this tell you about the relationship between learning rate, convergence speed, and stability?