Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Ensembles — The Wisdom of Many Models

A single expert's opinion can be wrong. A panel of independent experts, even if each is individually imperfect, tends to produce better decisions in aggregate. This insight — that combining diverse, imperfect models outperforms any single model — is the foundation of ensemble learning. Ensemble methods dominate machine learning competitions and are among the most reliable approaches in production systems. Understanding why they work requires thinking carefully about what kinds of errors models make and how combination cancels those errors.

Bagging: Reducing Variance by Averaging

Bagging (Bootstrap Aggregating, introduced by Leo Breiman in 1994) addresses the high-variance problem of unstable learners like decision trees. The key idea: if a model has high variance, training it on slightly different data produces a different model. If we train many such models and average their predictions, the average is more stable than any individual model — because the variance of an average of N independent variables is 1/N of the variance of each variable. The bagging procedure: 1. Create B bootstrap samples. Each bootstrap sample is created by drawing N examples from the training set with replacement (allowing duplicates). Each sample has the same size as the original training set but typically contains about 63% unique examples, with the rest being duplicates. 2. Train one model (e.g., a decision tree) on each bootstrap sample, independently. 3. At prediction time: for classification, take a majority vote among the B models. For regression, take the mean of the B models' outputs. A concrete example: Suppose you train 3 trees on 3 bootstrap samples. A new example gets predictions: Tree1=Spam, Tree2=Spam, Tree3=Not Spam. Majority vote: Spam. The ensemble is more confident than any individual tree because two of three agree. Out-of-bag error: Each bootstrap sample leaves out roughly 37% of the training data. These out-of-bag examples can be used to estimate generalization error without a separate validation set — each example is evaluated by the models that did not train on it.

Random Forests: Bagging Plus Feature Randomness

A Random Forest is bagging applied to decision trees, with one additional twist: at each split, only a random subset of features (typically √D for classification or D/3 for regression, where D is the total number of features) is considered. This extra randomness decorrelates the trees — they make different mistakes — making the ensemble more powerful than plain bagging. Random Forests are one of the most robust off-the-shelf algorithms in machine learning.

Why does averaging help precisely? Consider B models each with variance σ² and no correlation between their errors. The variance of their average is σ²/B. With B=100 trees, variance drops to 1% of a single tree's variance. In practice, tree errors are correlated (all trained on similar data), but the variance reduction is still substantial. Bias is unchanged by averaging — if every model is biased in the same direction, the average will be too. Bagging helps with variance but not bias. This is why bagging works well for low-bias, high-variance models like deep decision trees, and does not help much for already-biased models like shallow linear regression.

Boosting: Converting Weak Learners into Strong Ones

Boosting takes a fundamentally different approach: instead of training models independently in parallel, it trains models sequentially, each one focusing on the mistakes of the previous. The AdaBoost algorithm (Adaptive Boosting): 1. Assign equal weight to every training example: wᵢ = 1/N. 2. Train a weak learner (e.g., a decision tree with depth 1, called a 'stump') on the weighted training set. 3. Compute the weighted error rate of this learner. Examples it misclassified get higher weights for the next round. 4. Compute the learner's contribution weight αₜ = 0.5 × ln((1 - error) / error). A more accurate learner gets a larger αₜ. 5. Repeat for T rounds, adding each weighted learner to the ensemble. 6. Final prediction: weighted majority vote of all T learners. Gradient Boosting generalizes this idea: each new model is trained to predict the residual errors (the mistakes) of the combined ensemble so far. This is equivalent to gradient descent in the space of functions rather than parameters. XGBoost, LightGBM, and CatBoost are highly optimized gradient boosting libraries. They have won more Kaggle competitions than any other algorithm family and are workhorses of industrial ML pipelines. Key differences from bagging: - Sequential, not parallel training - Each model is specialized on the previous model's mistakes - Boosting can reduce both bias and variance - More sensitive to noisy labels (misclassified outliers get progressively up-weighted)

Boosting Can Overfit if Poorly Tuned

Boosting iteratively minimizes training error. With enough rounds, it will fit training noise perfectly. The number of boosting rounds, the learning rate (which scales each model's contribution), and the tree depth are all hyperparameters that must be tuned carefully, typically using early stopping on a validation set: stop adding models once validation error stops improving.

Match each ensemble concept to its defining characteristic.

Terms

Bagging
Boosting
Random Forest
Out-of-bag error
Weak learner

Definitions

Generalization estimate using examples not included in each bootstrap sample
Bagging over decision trees with random feature subsets at each split
Trains models sequentially, each correcting the previous model's errors
Trains models on bootstrap samples in parallel and averages predictions
A model only slightly better than random chance, used as a building block in boosting

Drag terms onto their definitions, or click a term then click a definition to match.

A bagging ensemble trains 200 trees. Each tree has variance σ² = 0.04 and all tree errors are uncorrelated. What is the variance of the ensemble's average prediction?

A boosting algorithm has been running for 500 rounds. Training error is 0.001 but validation error has been increasing for the last 100 rounds. What should you do?

Simulate a Bagging Ensemble

  1. Work in groups of four.
  2. You have 8 training examples:
  3. (1,A), (2,A), (3,B), (4,B), (5,A), (6,B), (7,A), (8,B)
  4. Each student creates their own bootstrap sample of 8 examples by drawing 8 times with replacement from the 8 examples (roll a die or use a random number table to pick which examples).
  5. Step 1: Each student trains a 'decision stump' (a rule based on just one threshold) on their bootstrap sample. Find the threshold on the index (1-8) that best separates A from B in your sample.
  6. Step 2: Each student applies their stump to these new points: index 2.5 and index 5.5. Record the prediction (A or B) for each.
  7. Step 3: Take a majority vote among the four stumps for each new point.
  8. Step 4: Compare: did any individual stump disagree with the ensemble majority? Which do you trust more, and why?
  9. Step 5: Discuss what would happen if all four students happened to draw identical bootstrap samples. Would the ensemble still help?