Linear and Logistic Regression
Two algorithms underpin more production machine learning systems than anything else: linear regression and logistic regression. Despite their age — both were formalized in statistics long before the term 'machine learning' existed — they remain indispensable. They are fast, interpretable, and frequently good enough. More importantly, understanding them precisely gives you the conceptual foundation for nearly every more complex model that follows.
Linear Regression: Fitting a Line to Data
Linear regression assumes the relationship between the input features and the target is — approximately — a weighted sum. The prediction function is: ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ where x₁ through xₙ are the input features, w₁ through wₙ are the learned weights (also called coefficients), and w₀ is the bias term (also called the intercept). Together the weights define which direction and how steeply the predicted value changes as each feature changes. A concrete example: predict house sale price from two features — square footage (x₁) and number of bedrooms (x₂). After training on historical sales, the model might learn: ŷ = 50,000 + 120 × x₁ + 8,000 × x₂ Interpretation: the base price is $50,000. Each additional square foot adds $120. Each additional bedroom adds $8,000. For a 1,500 sq ft, 3-bedroom house: ŷ = 50,000 + 120(1500) + 8,000(3) = 50,000 + 180,000 + 24,000 = $254,000 Training finds the weights that minimize the Mean Squared Error over the training set: MSE = (1/N) Σ (ŷᵢ - yᵢ)² For linear regression, this minimization has a closed-form solution (the normal equations), but in practice, especially with many features, gradient descent is used — iteratively updating weights in the direction that reduces the MSE.
In linear regression, each weight wⱼ tells you exactly how much the prediction changes when feature xⱼ increases by one unit, holding all other features constant. This interpretability is a major practical advantage — a business can understand and audit the model's reasoning. Complex models like deep neural networks do not offer this property.
Regularization prevents linear regression from overfitting when features are many or correlated. Ridge regression (L2 regularization) adds a penalty term λ × Σwⱼ² to the MSE. This shrinks weights toward zero, reducing model complexity without eliminating features entirely. Lasso regression (L1 regularization) adds a penalty λ × Σ|wⱼ|. This can drive some weights exactly to zero, effectively performing feature selection — automatically ignoring features that are not useful. The hyperparameter λ controls the strength of regularization. λ = 0 is plain linear regression; larger λ imposes stronger shrinkage. Choosing λ is done by cross-validation, not by inspecting the test set.
Logistic Regression: Turning a Line into a Probability
Despite its name, logistic regression is a classification algorithm. It predicts the probability that an example belongs to the positive class. The core idea: start with the same weighted sum as linear regression, but pass the result through the sigmoid function σ to map it into the range (0, 1): z = w₀ + w₁x₁ + ... + wₙxₙ P(y=1 | x) = σ(z) = 1 / (1 + e^(-z)) The sigmoid function has a distinctive S-curve shape. When z is large and positive, σ(z) approaches 1 (high probability of positive class). When z is large and negative, σ(z) approaches 0. When z = 0, σ(z) = 0.5 — the model is exactly indifferent. A worked example: spam detection with two features. Trained weights: w₀ = -3, w₁ = 0.4 (exclamation count), w₂ = 2.1 (contains 'free'). For an email with x₁=8 exclamation marks and x₂=1 (contains 'free'): z = -3 + 0.4(8) + 2.1(1) = -3 + 3.2 + 2.1 = 2.3 P(spam) = 1 / (1 + e^(-2.3)) ≈ 1 / (1 + 0.100) ≈ 0.909 The model is 90.9% confident this email is spam. Using a threshold of 0.5, we predict spam. The decision boundary is where P(y=1|x) = 0.5, which occurs when z = 0, i.e., w₀ + w₁x₁ + w₂x₂ = 0 — a linear boundary. Training minimizes cross-entropy loss (not MSE), which is appropriate for probability outputs: L = -(1/N) Σ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]
A linear regression model can output values outside [0, 1] — it might predict -0.3 or 1.7 for a binary label. These are not valid probabilities. More critically, linear regression penalizes very confident correct predictions (e.g., predicting 0.99 for a true positive gets penalized less than predicting exactly 1.0), which distorts the weights. Logistic regression with cross-entropy loss is specifically designed to produce calibrated probabilities.
Prompt Challenge
Write a prompt asking an AI assistant to explain logistic regression to a high-school student who understands basic algebra but has never seen calculus.
Your prompt should…
- Begin with the specific audience and their knowledge level
- Request a concrete numerical worked example using realistic numbers
- Ask for an explanation of what the output probability actually means in a real decision
A linear regression model for house prices learns w₁ = -500 for the feature 'distance from city center in km.' What does this weight tell you?
A logistic regression model outputs z = -1.5 for a new patient. What is the predicted probability of the positive class (disease present), and what would the model predict at a 0.5 threshold?
Train a Logistic Regression by Hand
- You will compute one step of logistic regression manually.
- Setup: two features, two training examples.
- Example 1: x₁=2, x₂=0, y=0 (negative class)
- Example 2: x₁=0, x₂=3, y=1 (positive class)
- Initial weights: w₀=0, w₁=0.5, w₂=-0.5
- Step 1: Compute z for each example using z = w₀ + w₁x₁ + w₂x₂.
- Step 2: Compute P(y=1) = σ(z) = 1/(1+e^(-z)) for each example. Use e^0.5 ≈ 1.65 and e^(-1) ≈ 0.37 as needed.
- Step 3: Compute the error for each example: error = P(y=1) - y.
- Step 4: Using learning rate η = 0.1, update each weight using the rule: wⱼ ← wⱼ - η × (1/N) × Σ(errorᵢ × xⱼᵢ). Update w₀, w₁, w₂.
- Step 5: With the updated weights, recompute z for example 1. Has the model moved closer to predicting y=0 for it?
- Discuss: how many steps like this would a real training loop run?