The ML Pipeline, End to End
Every machine-learning system you have ever encountered — a spam filter, a recommendation engine, a medical image classifier — was built by following a sequence of interdependent steps. Those steps together are called the machine-learning pipeline. This lesson maps the entire pipeline from start to finish. You will return to each stage in depth over the next nine lessons; here the goal is to see how the stages connect, understand why the order matters, and recognize that the pipeline is not a straight road but a loop you must revisit continuously.
Stage 1: Frame the Problem
A pipeline begins not with data or code but with a question. What decision should the model make? What output is expected, and what counts as a correct one? Suppose a hospital wants to reduce missed diagnoses of diabetic retinopathy — a leading cause of blindness — by screening retinal photographs automatically. Before anyone touches a dataset, the team must answer: Is this a binary classification (disease present or absent) or a multi-class task (grade the severity)? What is the acceptable false-negative rate? Who is harmed if the model errs in each direction? How will the output be used — as a final decision or as a flag for a human specialist to review? These are not technical questions; they are design questions. Bad answers here corrupt every stage that follows.
Machine learning problems generally fall into three families. Classification assigns an input to one of a discrete set of categories (spam or not spam; digit 0-9). Regression predicts a continuous numerical value (a house price, a temperature tomorrow). Clustering groups unlabeled inputs by similarity without predefined categories. Choosing the right task type is the first design decision of the pipeline.
Stages 2-4: Data — Collection, Cleaning, and Splitting
Once the problem is framed, you need data that represents it. Data collection means gathering examples: the inputs the model will learn from and, for supervised learning, the correct output labels for each input. For the retinopathy example, this means retinal photographs paired with diagnoses from expert ophthalmologists. Real data is almost never clean. Images may be mislabeled, duplicated, or taken under inconsistent lighting. Tabular records may have missing values, impossible entries, or inconsistent units. Data cleaning addresses these problems — not by pretending they do not exist but by making explicit decisions: impute a missing value or discard that row? Cap an outlier or keep it? Every decision must be documented. After cleaning, the dataset is split into three non-overlapping subsets. The training set is what the model learns from. The validation set is used to tune the model's settings during development, without contaminating the final evaluation. The test set is held out until the very end and used exactly once to report honest performance. Mixing these splits is one of the most common and damaging errors in ML practice.
Data leakage occurs when information from outside the training set — especially from the test set — influences the model during training or tuning. A model trained with leaked data appears to perform well but fails in production. Always split data before any preprocessing step that learns from it, including normalization.
Stages 5-6: Features and Training
Raw data is rarely the right input for a model. Feature engineering is the process of selecting, transforming, and constructing the numerical representations the model will actually use. For a tabular dataset, this might mean encoding a categorical variable like city name as a set of binary columns, or combining two columns into a ratio that better captures the underlying signal. The inputs a model receives are called features. With features in hand, training begins. Training is an optimization process: the model starts with random internal settings called parameters, makes predictions on the training data, measures how wrong those predictions are using a loss function, and then adjusts the parameters to reduce that error. This adjustment process is repeated thousands or millions of times. The result is a model whose parameters have been tuned to the patterns in the training data.
Match each pipeline stage to what it produces.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Stages 7-9: Evaluate, Deploy, Monitor
After training, evaluation measures how well the model actually generalizes — performs on data it has never seen. This is where the held-out test set is used. Different metrics are appropriate for different problems; a model achieving 99 percent accuracy on a rare-disease detector may still be useless if it simply predicts 'healthy' for everyone. Deployment moves the model from a development environment into production, where real users or systems send it real inputs and expect real outputs. This can mean a web API, an embedded system, a batch scoring job, or many other architectures. Monitoring is the stage most often skipped by beginners and most often regretted by professionals. The world changes: the distribution of inputs shifts, labels that were reliable become stale, model performance degrades. Monitoring detects this drift and triggers retraining or redesign. This is why the pipeline is drawn as a loop — deployment is not the finish line.
In production ML, the pipeline cycles continuously. Monitoring surfaces problems, which trigger new data collection, retraining, and re-evaluation. The diagram of the pipeline should always be drawn as a circle, never a straight arrow.
A team trains a model on hospital records, achieves 97% accuracy, and ships it to a new hospital. Performance drops to 70%. Which pipeline failure most likely explains this?
Why must the test set be used exactly once, at the very end of development?
Pipeline Audit
- Choose any AI-powered product you use regularly — a music recommender, an autocorrect system, a photo tagger, or anything else.
- For each of the seven pipeline stages (frame, collect, clean, engineer, train, evaluate, deploy/monitor), write one or two sentences describing what you think happens at that stage for your chosen product. You may need to make educated inferences.
- Then identify the stage where you think a mistake would be hardest to fix after the fact, and write a paragraph explaining your reasoning.
- Share your analysis with a partner and discuss where your inferences differ.