Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Activation Functions

You established in the previous lesson that activation functions are necessary — without them, a deep network collapses into a single linear transformation. But which nonlinearity should you use, and does the choice matter? The answer is yes, substantially. Different activation functions have different behaviors, computational costs, and failure modes. This lesson studies three functions in depth: sigmoid, tanh, and ReLU, and explains why the field moved from the first two to the third.

Sigmoid and Tanh

The sigmoid function is defined as: sigma(z) = 1 / (1 + e^(-z)) It maps any real number to the interval (0, 1). When z is very large and positive, sigma approaches 1. When z is very large and negative, sigma approaches 0. At z = 0, sigma(0) = 0.5. Computed example: sigma(2) = 1 / (1 + e^(-2)) = 1 / (1 + 0.135) ≈ 0.881. The hyperbolic tangent (tanh) is: tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z)) It maps any real number to (-1, 1). At z = 0, tanh(0) = 0. Computed example: tanh(1) = (e - e^(-1)) / (e + e^(-1)) ≈ (2.718 - 0.368) / (2.718 + 0.368) ≈ 2.350 / 3.086 ≈ 0.762. Both functions are smooth (infinitely differentiable) and squeeze their input into a bounded range. Historically they were the default choices. But both suffer from a critical failure mode called the vanishing gradient problem.

The Vanishing Gradient Problem

The derivative of sigmoid at z = 5 is approximately 0.007 — nearly zero. When gradients are multiplied together during backpropagation across many layers (covered in Lesson 7), multiplying many numbers less than 1 together produces a number approaching zero exponentially fast. Earlier layers receive almost no gradient signal and barely learn. This is why very deep networks trained with sigmoid or tanh often fail to converge.

ReLU — Rectified Linear Unit — is defined as: ReLU(z) = max(0, z) It is disarmingly simple: if z is positive, output z unchanged; if z is negative or zero, output 0. Computed examples: ReLU(3.5) = 3.5 ReLU(-1.2) = 0 ReLU(0) = 0 The gradient of ReLU is also simple: dReLU/dz = 1 if z > 0, 0 if z <= 0 When z > 0, the gradient is 1 — it does not shrink. This directly solves the vanishing gradient problem for positive activations: gradients flow backward through ReLU neurons without being scaled down. This is the primary reason ReLU became the default activation function for hidden layers in modern deep networks around 2012 and has remained so since.

Choosing and Using Activation Functions

In practice, hidden layers almost always use ReLU or a variant (Leaky ReLU, ELU, GELU). The output layer uses a different function chosen to match the prediction target: - Binary classification (yes/no): sigmoid output, producing a probability in (0,1) - Multi-class classification (one of K classes): softmax, which produces K probabilities summing to 1 - Regression (predicting a real number): often no activation (linear output), so predictions are unbounded Leaky ReLU addresses ReLU's own failure mode — the dying ReLU problem. If a neuron's pre-activation is always negative (because weights drift negative during training), its gradient is always zero and it never updates. Leaky ReLU solves this by using a small slope alpha (e.g., 0.01) for negative inputs: LeakyReLU(z) = z if z > 0, alpha*z if z <= 0

Complete the sentences using the correct technical terms.

The sigmoid function maps any real input to the interval , while tanh maps to . ReLU avoids the vanishing gradient problem because its gradient for positive inputs is exactly .
Why Nonlinearity Is the Point

The activation function is not a convenience — it is the source of a neural network's expressive power. Without it, the network is a linear model regardless of depth. With it, the network can approximate arbitrarily complex functions. The specific nonlinearity chosen shapes how well gradients flow and therefore how well the network learns.

A neuron uses sigmoid activation and receives z = -10. Approximately what is its output, and what does this mean for learning?

Why is ReLU preferred over sigmoid for hidden layers in a 20-layer network?

Activation Function Graph

  1. Step 1: On graph paper (or a blank page with hand-drawn axes), draw the x-axis from -4 to 4 and y-axis from -1.5 to 1.5.
  2. Step 2: Plot ReLU by computing and marking the points: (-4,0), (-2,0), (0,0), (1,1), (2,2), (4,4) — note you will need to extend the y-axis upward.
  3. Step 3: On the same axes, sketch sigmoid by marking: (-4,≈0.02), (-2,≈0.12), (0,0.5), (2,≈0.88), (4,≈0.98).
  4. Step 4: Mark the regions where sigmoid's slope is very shallow (near 0 and 4 on x-axis). Label these 'saturation zones.'
  5. Step 5: Explain in one sentence why the saturation zones are a problem during training. What does the flat slope correspond to mathematically?