Module Check
You have covered the full data pipeline: organizing examples into datasets, choosing features, attaching labels, cleaning errors, splitting for honest evaluation, handling class imbalance, identifying where bias hides, and designing a dataset from scratch. This lesson ties every thread together. Work through it carefully — the questions are designed to test understanding, not memorization.
Key Terms Review
Flashcards — click each card to reveal the answer
Multi-Topic Quiz
A row in a dataset represents a single house listing. The columns are: square footage, number of bedrooms, zip code, year built, and sale price. You want to predict sale price. Which columns are features?
You discover that 30% of rows in your dataset have a missing value in the 'income' column. You decide to fill each missing value with the average income from the entire dataset, including test rows. What mistake are you making?
A fraud detection model achieves 99.2% accuracy on a dataset where 99% of transactions are legitimate. What should you do before celebrating?
A hiring algorithm is trained on five years of past hiring decisions that favored candidates from elite universities. The data is accurate — those hires really happened. What type of bias is present?
You train a model, check its accuracy on the test set, adjust some features, retrain, and check the test set again — ten times. What problem does this create?
Which of the following best describes a labeled dataset?
Every ML project moves through the same data pipeline: 1. Define the prediction question and label. 2. Identify features that are measurable, available, and relevant. 3. Collect examples with a representative sampling strategy. 4. Clean: remove duplicates, fix errors, handle missing values. 5. Split: training set to learn from, test set for final honest evaluation. 6. Check class balance; address skew if needed. 7. Audit for bias: who is represented, who is not, what the labels actually measure. Each step protects the step after it. Skip one, and a flaw quietly flows forward.
Capstone: Dataset Design Review
- Step 1: Read this scenario. A city wants to predict which traffic intersections are likely to have an accident in the next month, so they can send inspectors to check signal timing and road markings.
- Step 2: Define the prediction task precisely: write the question, define one example (one row), and define the label.
- Step 3: Propose five features. For each, justify why it is measurable, available before the prediction period, and likely related to accident risk.
- Step 4: Identify one potential sampling bias in how the city might collect this data. How would you reduce it?
- Step 5: Identify one potential historical bias risk. What past pattern might the data encode that should not be reinforced?
- Step 6: Describe your train-test split strategy. Would you split randomly or by time? Why?
- Step 7: The city asks: 'Is this model fair?' Write two sentences on what you would check to answer that honestly.