Skip to main content
AI Foundations

⏱ About 20 min20 XP

From Perceptron to Network

In the mid-twentieth century, researchers asked a deceptively simple question: could a machine learn to classify things the way a neuron fires in response to input? The answer was the perceptron — a mathematical object so simple you can simulate it with pencil and paper, yet so important it launched the entire field of artificial neural networks. Understanding the perceptron is not a historical footnote; it is the foundation on which every modern neural network is built. This lesson traces the line from that single unit to the deep, layered networks that recognize your face in a photo and translate between languages in real time.

The Perceptron: A Single Trainable Unit

A perceptron takes several numerical inputs — call them x₁, x₂, …, xₙ — multiplies each by a corresponding weight w₁, w₂, …, wₙ, sums the results, and then produces an output by comparing that sum to a threshold θ. If the weighted sum exceeds the threshold, the perceptron outputs 1 (fires); otherwise it outputs 0 (stays silent). Concrete example. Suppose a perceptron has two inputs and the following values: x₁ = 0.8, w₁ = 0.5 x₂ = 0.3, w₂ = –1.0 threshold θ = 0.2 Weighted sum: (0.8 × 0.5) + (0.3 × –1.0) = 0.40 – 0.30 = 0.10 Because 0.10 < 0.20, the output is 0. Change x₁ to 1.0 and the sum becomes 0.50 – 0.30 = 0.20, exactly meeting the threshold — output 1. The weights and the threshold are the parameters. Training a perceptron means adjusting the weights until the output matches the expected label for each training example. The learning rule is straightforward: when the perceptron makes an error, nudge each weight in the direction that would have produced the correct answer, proportional to the input value.

What the Perceptron Computes

A perceptron draws a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) through the input space and classifies each input as being on one side or the other. This geometric insight is crucial: it means a single perceptron can only separate classes that are linearly separable.

Rosenblatt's perceptron convergence theorem (1962) proved that if a dataset is linearly separable, the perceptron learning rule is guaranteed to find a correct set of weights in a finite number of steps. This was a remarkable early result. But Minsky and Papert (1969) dealt a critical blow: they showed that a single perceptron cannot learn the XOR function — the pattern where an output is 1 only when its two binary inputs differ (1,0 → 1; 0,1 → 1; 0,0 → 0; 1,1 → 0). If you plot the four XOR inputs on a grid, no single straight line separates the 1-outputs from the 0-outputs. This is not a limitation of the learning rule; it is a fundamental geometric constraint. The perceptron is powerful within its domain and powerless outside it.

Why Connecting Units Solves the Problem

The solution to XOR is straightforward once you think geometrically: if one line cannot separate the data, use two lines and combine them. A second perceptron can draw a different boundary; a third perceptron (connected to the first two) can then combine their outputs to draw a more complex decision region. This is the key intuition behind the multilayer perceptron (MLP): stack units into layers, and connect the output of each unit to the input of the units in the next layer. Now the network as a whole can represent functions that no single unit could express. The first layer detects simple features; subsequent layers combine those features into increasingly abstract patterns. By the final layer, the network can represent highly complex, nonlinear decision boundaries. The word 'deep' in deep learning refers simply to having many such layers. A network with two hidden layers is deeper than one with one; modern large models have hundreds. Depth is not an end in itself — it is a structural choice that lets a network learn hierarchical representations, which happen to be extraordinarily effective for the kinds of data humans care about: images, text, and speech.

Depth vs. Width

Wider layers (more units per layer) and deeper networks (more layers) both increase capacity, but they do it differently. Width adds parallel feature detectors; depth composes features into higher-level abstractions. In practice, most successful architectures use both. The interaction between depth, width, and the training procedure is an active research area — there is no simple rule that determines the right shape for a given problem.

Match each term to its correct description.

Terms

Perceptron
Weight
Threshold
Linear separability
Hidden layer

Definitions

The property of a dataset whose classes can be divided by a single hyperplane
An intermediate layer between inputs and outputs that learns internal representations
A learned numerical coefficient that scales one input's contribution
A single unit that computes a weighted sum of inputs and applies a threshold
The value the weighted sum must exceed for the unit to output 1

Drag terms onto their definitions, or click a term then click a definition to match.

A single perceptron cannot learn XOR because:

What does the term 'deep' mean when someone says 'deep learning'?

XOR by Hand — Why One Line Fails

  1. Draw a 2-by-2 grid. Label the x-axis 'Input A' and the y-axis 'Input B', both ranging from 0 to 1.
  2. Plot the four XOR inputs as points: (0,0) → label 0, (0,1) → label 1, (1,0) → label 1, (1,1) → label 0. Use different symbols for label-0 and label-1 points.
  3. Try to draw a single straight line that separates all label-1 points from all label-0 points. Show why no such line exists.
  4. Now, draw TWO lines. Show how the intersection of the two half-planes they define correctly isolates the label-1 points.
  5. Reflect in one paragraph: what structural change to a perceptron-based system would let it implement both lines simultaneously? What does this tell you about why we need multiple layers?