Skip to main content
AI Foundations

⏱ About 20 min20 XP

Choosing and Training a Model

With clean data and engineered features in hand, the next step in the pipeline is choosing a model and training it. For many beginners, this feels like the moment machine learning truly begins — the mysterious part where the computer learns. In reality, the steps before this one (framing, data, features) typically determine the ceiling of what is achievable. What this lesson explains is what training actually is, mechanically, and how the choice of model family shapes what a model can and cannot represent.

Families of Models

A model is a mathematical function that maps inputs (features) to outputs (predictions). Different model families use different functional forms, and each has distinct strengths, weaknesses, and assumptions. Linear models assume the relationship between inputs and output is approximately linear — that you can write the prediction as a weighted sum of the features. Linear regression predicts a continuous output; logistic regression (despite its name) predicts a probability for classification. Linear models are fast, interpretable, and work well when the true relationship is roughly linear or when data is limited. Decision trees split the feature space into rectangular regions using a sequence of if-then rules. A tree predicts the majority class or average outcome of the training examples in each region. Trees are interpretable and handle non-linear relationships naturally, but a single deep tree overfits severely. Ensemble methods combine many models to reduce overfitting and improve accuracy. Random forests train hundreds of decision trees on random subsets of data and features, then average their predictions. Gradient boosting (implemented in libraries like XGBoost and LightGBM) trains trees sequentially, each correcting the errors of the previous one. Gradient boosting is among the most powerful methods for structured tabular data. Neural networks are composed of layers of interconnected numerical units called neurons. Each layer applies a weighted sum followed by a nonlinear function. Neural networks can represent extraordinarily complex functions, which makes them dominant for images, audio, and text — but they require large datasets, substantial compute, and careful tuning.

No Free Lunch

The No Free Lunch theorem, proven by David Wolpert in 1996, states that no single algorithm outperforms all others across all possible problems when performance is averaged over all possible datasets. In practice: there is no universally best model. Choosing a model requires understanding your data, your constraints, and the assumptions each model family makes.

What Training Actually Does

Training is an optimization problem. The model has parameters — numbers that define its specific function. For a linear model with two features, there are three parameters: a weight for each feature and a bias term. For a neural network with millions of neurons, there are millions of parameters. At the start of training, parameters are initialized, often randomly. The model makes predictions on training examples and those predictions are compared to the true labels using a loss function. The loss function produces a single number: how wrong the model's predictions are, on average. Common choices: mean squared error for regression, cross-entropy loss for classification. The optimization algorithm then adjusts the parameters to reduce the loss. The dominant algorithm is gradient descent: compute the gradient of the loss with respect to each parameter (which direction, and how steeply, each parameter change would increase or decrease the loss), then shift all parameters a small step in the direction that decreases the loss. This step is called an update. Repeat thousands or millions of times across the training data. Practical training uses stochastic gradient descent (SGD) or variants like Adam, which apply updates using small random batches of training examples rather than the entire dataset at once. When training ends, the parameters are fixed. The trained model is simply a function with those specific parameter values baked in.

Gradient Descent Intuition

Imagine you are blindfolded on a hilly landscape and your goal is to reach the lowest valley. You can feel the slope of the ground under your feet. Gradient descent says: always take a step in the direction that goes downhill most steeply. With small enough steps and a cooperative landscape, you will reach a valley. The loss function is the landscape; the valley is a good set of parameters.

Hyperparameters

Parameters are the internal values a model learns from data. Hyperparameters are settings about the training process that must be chosen by the practitioner before training begins, and they are not learned from data. Examples: the learning rate in gradient descent (how large each step is), the number of trees in a random forest, the depth limit of a decision tree, the number of layers and neurons in a neural network, and the regularization strength (covered in Lesson 8). The learning rate illustrates the stakes. Too large a learning rate: the optimization takes steps so large it overshoots good parameter values and may diverge entirely. Too small: training takes vastly longer and may settle in a poor local minimum. A learning rate of 0.1 might diverge; a rate of 0.00001 might never converge in a reasonable time; something around 0.001 or 0.0003 often works for Adam on many neural networks — but the right value depends entirely on the problem. Hyperparameters are tuned using the validation set. You train a model with one set of hyperparameters, evaluate its performance on the validation set, then try another set, and repeat. Common search strategies include grid search (try all combinations in a specified range), random search (sample hyperparameter combinations randomly), and Bayesian optimization (model which combinations are likely to be good and search intelligently).

Complete these statements about model training.

The function that measures how wrong a model's predictions are is called the function. The internal values a model learns from data are called , while settings chosen before training are called .

Why is gradient descent applied in small steps (small learning rate) rather than computing the exact best parameters in one step?

A practitioner trains a random forest with 1,000 trees and achieves excellent validation performance. She then trains the same model with 10 trees and finds worse performance. Which type of setting is 'number of trees'?

Model Selection Reasoning

  1. You are advising a team on model choice for three different problems. For each problem, recommend a model family and write a paragraph justifying your choice based on the properties discussed in this lesson. Do not just pick 'neural network' for everything — be precise.
  2. Problem A: A bank wants to predict whether a loan will default. They have 500,000 historical loan records with 30 features each. They need the model's decisions to be explainable to regulators.
  3. Problem B: A startup wants to classify whether a photograph contains a cat. They have 50,000 labeled images.
  4. Problem C: A retailer wants to forecast weekly sales for each of 200 products using the past three years of sales data plus features like price, promotions, and seasonality.
  5. For each, address: dataset size, whether interpretability is required, whether the relationship is likely linear, and any computational constraints you infer from the scenario.