Data Workshop
This lesson does not introduce new concepts. It gives you a messy, realistic dataset scenario and asks you to apply everything from this module: spotting problems, naming them correctly, deciding what to do, and reasoning about the downstream consequences for a model. Data scientists spend most of their working hours doing exactly this. Let us practice.
The Scenario
A school district wants to build a model to predict whether a student will need extra support in math by the end of the year. They have been collecting data for two years. Here is a sample of their raw dataset: Student ID | Grade | Absences | Quiz Avg | Teacher Rating | Parent Meeting | SupportNeeded 1001 | 6 | 3 | 82.0 | Good | Yes | No 1002 | 7 | 11 | 61.5 | Poor | No | Yes 1003 | 6 | 3 | 82.0 | Good | Yes | No 1004 | 8 | | 74.0 | Fair | Yes | No 1005 | 7 | 2 | 120.0 | Excellent | No | No 1006 | 6 | 5 | 68.0 | Good | No | ? 1007 | 8 | 19 | 55.5 | Poor | No | Yes 1008 | 7 | 3 | 71.0 | Good | Yes | No The column 'SupportNeeded' is the label. All other columns except Student ID are intended as features.
Work through each problem type systematically. For every issue you find: name the row and column, identify the type of problem (missing value, error, duplicate, bias risk, leakage risk), and propose a specific action. Good data science is methodical, not random.
Let us walk through the problems together. Problem 1 — Duplicate rows. Students 1001 and 1003 are identical across every column. One is a duplicate and should be removed. Keeping it makes the model learn that particular pattern twice, artificially inflating its confidence. Problem 2 — Missing value. Student 1004 has no entry in the Absences column. Options: remove the row (but you lose a valid example), or impute — substitute the average absences across all other students in the same grade. Imputing preserves the row. Problem 3 — Error. Student 1005 has a Quiz Average of 120.0. Quizzes are out of 100. This is an impossible value. It should be treated as missing and either removed or imputed. Problem 4 — Missing label. Student 1006 has a '?' in SupportNeeded. A row with an unknown label cannot be used for supervised learning — it must be excluded from the training set (though it could be a candidate for prediction once the model is trained). Problem 5 — Bias risk in Teacher Rating. The feature Teacher Rating is a categorical assessment made by the classroom teacher. Could teacher judgments vary by student background in ways unrelated to math skill? This is a label-bias-adjacent concern. It does not mean the feature must be removed, but it warrants monitoring: does model accuracy differ across student demographic groups?
Thinking About the Split
After cleaning — removing the duplicate (keep one copy), imputing or dropping rows with errors or missing values, and excluding the unlabeled row — your cleaned dataset is smaller. Before splitting, reconsider: Is the dataset still large enough to split? With very small datasets, practitioners sometimes use a technique called cross-validation instead of a single train-test split. The idea: divide the data into five equal parts, train on four parts and test on the remaining one, repeat five times with a different test part each time, and average the results. More reliable with small data. Is the label distribution balanced after cleaning? Count SupportNeeded = Yes and SupportNeeded = No. If heavily skewed, plan for it. Does the split preserve the time structure? If older student records are all in training and newer ones are all in test, the model is being tested on the future — closer to real-world use. This is often more honest than a random shuffle for time-ordered data.
Complete the rule about labeled examples.
In the workshop dataset, what type of problem does the duplicate rows 1001 and 1003 represent?
Student 1006 has '?' as their label. What is the correct action?
Full Pipeline Walkthrough
- Step 1: Copy the 8-row dataset from above onto paper or a spreadsheet.
- Step 2: Remove the duplicate. Which row did you keep? Why?
- Step 3: Handle the missing Absences value for student 1004. Calculate the average absences across the remaining non-duplicate rows in grade 8 and impute it.
- Step 4: Handle the impossible Quiz Average of 120.0 for student 1005. What do you replace it with?
- Step 5: Remove student 1006 from the training set.
- Step 6: With your cleaned dataset, count how many rows have SupportNeeded = Yes and how many have No. Is it balanced?
- Step 7: If you were to split this cleaned dataset 80-20, how many rows go to training and how many to testing? (Round to whole rows.)
- Step 8: Write two sentences about the biggest risk remaining in this dataset even after cleaning.