Cleaning Messy Data
Real data is never perfect. Surveys get skipped. Sensors malfunction. People type their age as 999 by accident. Two records for the same person get entered twice. If you feed a model dirty data, you get a dirty model — and a dirty model gives wrong answers in the real world. Data cleaning is not glamorous, but experienced ML engineers often say it takes up more than half of their time.
Three Major Problems in Raw Data
Missing values occur when a cell has no entry at all. In a health survey, some participants might have skipped the blood pressure question. That cell is empty. Models cannot calculate with emptiness, so you must decide what to do. Errors are values that exist but are wrong. A student's recorded age of 217 is an error. A temperature reading of minus 500 Celsius is impossible. A zip code entered as 'Zippy' is a text error in a numeric column. Errors can fool a model into learning nonsense patterns. Duplicates are rows that represent the same example twice. A customer whose order was entered twice. A patient who appears twice under slightly different name spellings. Duplicates make the model treat one example as if it were two, inflating its importance.
This is the oldest rule in data science: a model is only as good as the data it trains on. Errors and missing values do not disappear when you run a training algorithm — they get baked in. A model trained on dirty data will make predictions as dirty as its training set.
How do data professionals handle these problems? For missing values, the main strategies are: Deletion — remove the entire row. Simple, but you lose data. Only good when few rows are affected. Imputation — fill in the missing value with something reasonable. For a numerical column, the average value of that column is common. For a categorical column, the most common category is common. More advanced methods predict the missing value from other columns. For errors, you validate against known rules (age must be between 0 and 120; zip code must be five digits) and then either correct the value if you can, or treat it as missing if you cannot. For duplicates, you identify rows that are identical or nearly identical and keep only one copy.
Why Cleaning Order Matters
You should clean data before you split it (splitting is covered in the next lesson) and long before training. There is also an order to cleaning steps. First, remove exact duplicates — they are straightforward. Second, fix clear errors — impossible values, wrong data types. Third, handle missing values — the strategy depends on how many are missing and why. A warning: if you impute missing values using statistics from the entire dataset before splitting into train and test sets, information from the test set 'leaks' into the training set. The right approach is to compute imputation statistics only on the training set, then apply them to the test set separately. You will understand why after the next lesson.
Filling in missing values with averages is a practical fix, but it hides uncertainty. A column where 40 percent of values are imputed is much less reliable than one where 2 percent are imputed. Always note how much of each column was missing and how you handled it.
Match each data problem to its correct definition.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A dataset of patient ages contains the value 312 for one row. This is best described as:
Why should you compute imputation statistics only on the training set?
Spot the Mess
- Step 1: Below is a small dataset. Examine every cell.
- ID | Name | Age | Score | City
- 1 | Alice | 14 | 88 | Austin
- 2 | Bob | -3 | 91 | Denver
- 3 | Carol | 13 | | Austin
- 4 | Alice | 14 | 88 | Austin
- 5 | Dave | 12 | 105 |
- 6 | Eve | 13 | 79 | Chicago
- Step 2: List every problem you find. Name the row, column, and type of problem (missing, error, or duplicate).
- Step 3: For each problem, write what you would do to fix it.
- Step 4: After cleaning, how many rows remain? Is the cleaned dataset trustworthy? Explain.