Choosing the Right Algorithm
One of the most common mistakes in applied machine learning is treating algorithm selection as a competition to run rather than a design decision to make carefully. There is no universally best algorithm — the No Free Lunch theorem proves this formally: averaged over all possible problems, every algorithm performs identically. What matters is matching an algorithm's strengths to your specific problem's structure, your data's properties, and your operational constraints. This lesson gives you a principled framework for making that choice.
Dimensions of the Problem
Before selecting an algorithm, characterize your problem across five dimensions: 1. Task type: Is this classification or regression? Binary or multiclass? Answering this eliminates many candidates immediately. 2. Dataset size: How many training examples (N) and how many features (D)? - Small N (< 1,000): simple models (logistic regression, k-NN, small decision trees). Complex models overfit. - Medium N (1,000–100,000): most algorithms apply. Tree ensembles often excel. - Large N (> 100,000): linear models and gradient boosting scale well. k-NN becomes slow without approximate indexing. - High D (many features): linear models with regularization, tree ensembles. k-NN degrades (curse of dimensionality). 3. Data quality: Are there many missing values? Noisy labels? Class imbalance? - Decision trees and ensembles handle missing values more gracefully than logistic regression. - Boosting is sensitive to noisy labels; bagging is more robust. - Imbalanced classes require oversampling, undersampling, or class-weighted loss functions regardless of algorithm. 4. Interpretability requirement: Does the model need to be explainable to non-technical stakeholders, regulators, or patients? - High interpretability required: logistic regression, shallow decision trees. - Moderate: Random Forest with feature importance scores. - Low: gradient boosting, complex ensembles. 5. Training and inference time: How long can you spend training? How fast must predictions be? - Fast training and inference: logistic regression, shallow trees. - Slow inference: k-NN (must scan all training data). - Slow training, fast inference: gradient boosting, neural networks.
A near-universal best practice in applied ML: always train a simple baseline first (logistic regression for classification, linear regression for regression). Measure its performance. Then try more complex models. If a complex model does not meaningfully outperform the simple baseline, the simple baseline is usually the right choice — it is faster, cheaper, and more interpretable. Complexity is a cost; pay it only when performance justifies it.
A structured decision guide: Is N < 500? Yes → Use logistic/linear regression first. If it fails, try k-NN (k=5-10) or a shallow decision tree. Avoid ensembles — too little data to benefit. No → Continue. Do you need the model to explain its reasoning per prediction? Yes → Use logistic regression (global coefficients) or a shallow decision tree (explicit if-then rules). Stop. No → Continue. Is D > 500? Yes → Use logistic/linear regression with L1 (lasso) regularization for feature selection, or gradient boosting. Avoid k-NN. No → Continue. Do you have 10,000+ training examples and acceptable training time? Yes → Gradient boosting (XGBoost/LightGBM) is the default choice for tabular data. It is robust, powerful, and well-supported. No → Random Forest is an excellent choice: robust, minimal tuning required, and effective across a wide range of datasets. Cross-cutting advice: whatever algorithm you choose, use cross-validation to estimate performance, not a single train/test split. And always inspect your errors — the patterns in what the model gets wrong are often more informative than the overall accuracy number.
Every algorithm has hyperparameters — settings not learned from data that control model complexity (e.g., k in k-NN, max_depth in decision trees, n_estimators and learning_rate in gradient boosting). The performance gap between a poorly tuned and a well-tuned model can be dramatic. Use grid search or random search over a validation set (or use cross-validation) to find good hyperparameter values. Never tune hyperparameters using the test set — that gives an overly optimistic performance estimate.
Complete the algorithm selection guidance.
Practical Pitfalls
Beyond algorithm selection, several practical decisions shape outcomes more than algorithm choice: Feature engineering: For most tabular data problems, investing time in creating informative features (e.g., ratios, interaction terms, domain-specific transformations) yields bigger gains than switching algorithms. The algorithm is only as good as the features it receives. Data leakage: If any feature in your training set contains information that would not be available at prediction time, your test performance will be optimistically inflated. Example: using a patient's final diagnosis as a feature when predicting whether they need further testing — the diagnosis is the answer, not an input. Leakage is one of the most common mistakes in competitive ML and in industry. Distribution shift: Your model is trained on historical data, but the world changes. A model trained on loan applications from 2019 may perform poorly on 2024 applicants whose financial behaviors shifted due to economic events. Monitoring model performance in production over time is as important as the initial evaluation. Evaluation metric alignment: Always choose the metric your business actually cares about. If missing a fraudulent transaction costs 100 times more than a false alarm, optimize recall — not accuracy. If predicting house prices to within $10,000 is acceptable, RMSE at that scale is your target, not theoretical minimum MSE.
You have 300 training examples, 8 features, need to explain every prediction to a judge in a legal case, and predictions must be made in under 10 milliseconds. Which algorithm family is most appropriate?
A data scientist evaluates a new model and finds it achieves 96% accuracy on training data, 94% on validation data, and 94% on test data. She then realizes that one feature is 'case_resolution_date', which was populated only after the outcome she is predicting was known. What has occurred?
Algorithm Selection Case Files
- Work in groups of three. For each case file below, apply the five-dimension framework from this lesson to select an algorithm. Document your reasoning for each dimension.
- Case 1: A hospital wants to predict which ICU patients will need ventilator support in the next 6 hours. Dataset: 50,000 patients, 200 features (vital signs, lab values). Doctors need to understand why the model flagged a patient. False negatives (missed predictions) are catastrophic.
- Case 2: A startup wants to predict the resale price of used electronics listed on their platform. Dataset: 800 listings, 12 features (brand, age, condition, storage). Speed is important — predictions needed in under 50ms. Interpretability is nice to have, not required.
- Case 3: A bank needs to classify transactions as fraudulent or legitimate. Dataset: 2 million transactions, 45 features, only 0.03% fraudulent. Performance must be evaluated monthly. False positives (blocking legitimate transactions) anger customers; false negatives cost money.
- For each case: name your algorithm choice, name the most important dimension that drove the choice, and name one risk you would monitor in production.