The Forward Pass, Formally
You now know what a single neuron computes. A full neural network is thousands of neurons organized into layers, each layer passing its outputs as inputs to the next. The forward pass is the complete computation from network input to network output — a structured cascade of matrix multiplications and activation functions. In this lesson we trace a full numerical example through a small network and write down the general matrix equations.
Network Architecture and Notation
Define a fully connected (dense) network with: - Input layer: 3 features (x is a 3-dimensional vector) - Hidden layer 1: 2 neurons - Hidden layer 2: 2 neurons - Output layer: 1 neuron (binary classification) For layer l, define: W^(l) = weight matrix (rows = neurons in layer l, columns = neurons in layer l-1) b^(l) = bias vector (one entry per neuron in layer l) h^(l) = activation vector (output of layer l) h^(0) = x = input vector The forward pass for layer l: z^(l) = W^(l) h^(l-1) + b^(l) h^(l) = sigma(z^(l)) [applied element-wise] Note: W^(1) has shape 2x3 (2 hidden neurons, 3 inputs). W^(2) has shape 2x2. W^(3) has shape 1x2.
Writing one layer as z = W h^(prev) + b computes all neurons in that layer simultaneously as a single matrix-vector product. This is why neural networks can be implemented efficiently on GPUs — the entire forward pass is a sequence of dense matrix operations, which modern hardware is optimized to perform in parallel.
Concrete example. Set x = [1.0, 0.5, -1.0]^T. Layer 1 weights and biases: W^(1) = [[0.2, 0.8, -0.5], [0.6, -0.3, 1.0]] b^(1) = [0.1, -0.2]^T Compute z^(1): z^(1)[1] = (0.2)(1.0) + (0.8)(0.5) + (-0.5)(-1.0) + 0.1 = 0.2 + 0.4 + 0.5 + 0.1 = 1.2 z^(1)[2] = (0.6)(1.0) + (-0.3)(0.5) + (1.0)(-1.0) + (-0.2) = 0.6 - 0.15 - 1.0 - 0.2 = -0.75 Apply ReLU: h^(1) = [max(0,1.2), max(0,-0.75)] = [1.2, 0.0] Layer 2 weights and biases: W^(2) = [[0.5, -1.0], [-0.3, 0.7]] b^(2) = [0.0, 0.5]^T Compute z^(2): z^(2)[1] = (0.5)(1.2) + (-1.0)(0.0) + 0.0 = 0.6 z^(2)[2] = (-0.3)(1.2) + (0.7)(0.0) + 0.5 = -0.36 + 0.5 = 0.14 Apply ReLU: h^(2) = [0.6, 0.14] Output layer weights and biases: W^(3) = [[1.2, -0.8]] b^(3) = [0.1] Compute z^(3): z^(3) = (1.2)(0.6) + (-0.8)(0.14) + 0.1 = 0.72 - 0.112 + 0.1 = 0.708 Apply sigmoid: h^(3) = 1/(1+e^(-0.708)) ≈ 1/(1+0.493) ≈ 0.670 The network predicts probability 0.670 that this example belongs to the positive class.
Parameters and Computation
How many total parameters does our small network have? Layer 1: W^(1) is 2x3 = 6 weights; b^(1) is 2 values. Total: 8. Layer 2: W^(2) is 2x2 = 4 weights; b^(2) is 2 values. Total: 6. Layer 3: W^(3) is 1x2 = 2 weights; b^(3) is 1 value. Total: 3. Grand total: 17 parameters. A real network like GPT-2 small has 117 million parameters. A modern large language model may have hundreds of billions. The forward pass structure is identical — it is just W^(l) h^(l-1) + b^(l) repeated, with different layer sizes. The conceptual machinery you learned here scales directly.
Match each symbol to what it represents in a layer's forward pass.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Dimension errors are the most common bug when implementing a forward pass. Before writing any code, work out the shapes: if layer l has n neurons and layer l-1 has m neurons, then W^(l) is (n x m), h^(l-1) is (m x 1), z^(l) = W^(l) h^(l-1) + b^(l) is (n x 1). Track shapes at every step.
In the worked example, h^(1)[2] = 0.0 after ReLU. What caused this and what does it imply for layer 2's computation?
If you doubled the number of neurons in hidden layer 1 from 2 to 4, what would be the new shape of W^(1) and W^(2)?
Trace Your Own Forward Pass
- Step 1: Design a tiny network: 2 inputs, 3 neurons in one hidden layer, 1 output neuron.
- Step 2: Choose small weight values and biases (use numbers between -1 and 1 for ease).
- Step 3: Choose an input vector x = [x1, x2].
- Step 4: Compute z^(1) (a 3-dimensional vector) by hand, showing each dot product.
- Step 5: Apply ReLU element-wise to get h^(1).
- Step 6: Compute z^(2) and apply sigmoid to get the output probability.
- Step 7: Interpret your result: what probability does the network assign to this input? What would make it predict 'yes' (probability > 0.5)?