Learning From Mistakes
You now know that every model starts by making predictions, and that we can measure exactly how wrong those predictions are. But measuring error is only half the story. The other half — and the part that makes machine learning remarkable — is what happens next: the model adjusts itself to do better. That adjustment process is called learning, and it is the engine inside every neural network, every decision tree booster, every model you will ever use.
Weights and the Direction of Change
Recall from Lesson 1 that a model's predictions come from weights — numbers that scale the influence of each input feature. When the model makes an error, the question is: which weights caused the error, and which direction should each weight shift to reduce it? Imagine a dial on a stereo. If the music is too quiet, you turn the dial up. If it is too loud, you turn it down. Training a model is like adjusting hundreds or millions of dials simultaneously — each one nudged in the direction that makes the total error go down. The algorithm that figures out which direction to nudge each weight is called gradient descent. 'Gradient' means slope — the algorithm computes how steeply the loss changes as each weight changes. A weight with a large positive gradient is pulling the loss up, so the algorithm decreases it. A weight with a negative gradient is pulling the loss down, so the algorithm increases it slightly.
Gradient descent is the algorithm that adjusts model weights to reduce loss. It works by computing the gradient (slope) of the loss with respect to each weight, then nudging each weight in the opposite direction of the slope. Repeated many times, this walks the model toward better and better predictions.
Here is a concrete analogy. Imagine you are blindfolded on a hilly landscape and your goal is to reach the lowest valley. You cannot see the whole map, but you can feel the ground under your feet — which direction slopes downward. You take a small step downhill. Then feel again. Another step downhill. Repeat. Eventually you reach a low point. Gradient descent works identically, except the 'landscape' is the loss function and the 'position' is the set of all weights. Each step is one update to the weights. The low point — where loss is smallest — is what we call the trained model. The size of each step is controlled by a setting called the learning rate. Too large a step and you overshoot the valley and bounce around. Too small a step and training takes forever. Choosing a good learning rate is one of the tuning challenges you will meet in Lesson 8.
Backpropagation: Spreading the Credit
For simple models with only a few weights, computing gradients is straightforward. For deep neural networks with millions of weights, computing which weight contributed how much to the error is a bigger challenge. The algorithm that solves this is called backpropagation (often shortened to backprop). Backpropagation works backward through the network — from the output (where the error is known) back toward the input — applying the chain rule of calculus layer by layer. Each weight receives a number saying 'you contributed this much to the error, move by this amount.' You do not need to derive calculus here. What matters: backprop is how neural networks assign blame for an error across every single weight, so each one can be adjusted.
If training loss decreases smoothly over time, the learning rate is probably good. If loss jumps up and down wildly, the learning rate is too large. If loss barely moves after many rounds, the learning rate is too small. You will tune this in Lesson 8.
Fill in the blanks to complete the gradient descent description.
What is the learning rate in gradient descent?
What problem does backpropagation solve?
Dial Tuning Simulation
- Step 1: On paper, draw a simple table with three columns: Weight Name, Current Value, and Gradient Direction (Up or Down).
- Step 2: Invent a tiny model with three weights (e.g., w1=2, w2=5, w3=1) that predicts whether a student passes a test based on hours studied, sleep hours, and practice problems.
- Step 3: Invent a prediction error: the model over-predicted (said 'pass' too confidently). Decide which weights likely made the prediction too high.
- Step 4: Assign a gradient direction (Up = weight should increase, Down = weight should decrease) to each of the three weights.
- Step 5: Apply a 'learning rate' of 0.1: shift each weight by 0.1 in the chosen direction. Write the new weight values.
- Step 6: Reflect: if you repeated this process 1,000 times with real data, what do you think would happen to the loss?