Module Check
You have traveled through ten lessons building a serious, grounded understanding of data — what it is, where it comes from, how it is organized, what makes it good or bad, and why it matters so deeply for AI. This final lesson is your checkpoint. It is not a test you can cram for, because everything here is something you genuinely worked through. It is a moment to synthesize, to see the whole shape of the module, and to confirm that the ideas are solid enough to build on in the modules ahead.
Core Vocabulary Review
Flashcards — click each card to reveal the answer
Module Quiz
A fitness app records your heart rate every 5 seconds. What type of data is this, and is it active or passive?
A dataset has 10,000 rows and 3 columns. A researcher adds 2 new columns by measuring two additional features for every existing row. What changed in the dataset?
A language model is trained on web text that overwhelmingly associates the word 'scientist' with male pronouns because that was the historical norm in most published text. Which type of bias best describes this?
A medical AI was trained on patient records from hospitals in high-income urban areas. It is now deployed at rural clinics. Which quality dimension of the training data is most relevant to its likely poor performance?
You discover that 30% of the 'age' column in a dataset contains the value 999. What is the most likely explanation and data quality problem?
A company wants to use your past social media posts to train its AI sentiment classifier. You never explicitly agreed to this use. Which concept from this module is most relevant?
Every lesson in this module connects to one central idea: data is not neutral. It is collected by someone, from some people, at some time, using some measurement choices. Every one of those decisions shapes what the AI can learn — and who it will serve well or poorly. Asking 'where did this data come from, and who is in it?' is one of the most important critical thinking habits you can bring to AI. You now have the vocabulary and the framework to ask it rigorously.
Capstone: Design a Fair Dataset
- You are designing a training dataset for an AI that will recommend tutors to students at a nationwide learning platform that serves students from age 10 to 18, across all US states, across many languages, and at many income levels.
- Your capstone task has five parts:
- 1. Purpose and labels: State precisely what the AI will predict. Define the label you will use. Is this a classification or regression task?
- 2. Features: List at least six features you will collect. For each one, explain why it is relevant and whether it could introduce any bias.
- 3. Data sources: Identify which of the five source types (sensors, user activity, surveys, open web, curated datasets) you will use and why. Explain how you will collect data from students who do not have internet access at home.
- 4. Quality plan: For each of the four quality dimensions, name one specific risk for your dataset and one step you will take to address it.
- 5. Fairness review: Identify two groups who are at risk of being underrepresented in your dataset. Describe specific steps you will take to include them more fairly.
- Present your design to the class (or write it up as a one-page brief). Be ready to defend your choices — your classmates will ask hard questions about the tradeoffs you made.