Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Datasets and Rows

Every time you have used a recommendation, seen a translated sentence, or heard a voice assistant answer a question, a dataset was involved. A model cannot learn from nothing — it needs organized examples, and a dataset is exactly that: a structured collection of examples. Understanding how datasets work is the first step in understanding machine learning.

What a Dataset Is

A dataset is a collection of examples about the same subject, arranged so that each example has the same set of information recorded about it. Think of a school nurse's logbook. Each visit is one example. For every visit the nurse records the same things: date, student name, grade, complaint, and action taken. That logbook is a dataset. In machine learning we almost always store datasets as tables — rows and columns. One row is one example. The columns are the different pieces of information recorded for every example.

Row = One Example

In a dataset, each row represents a single example — one observation, one record, one data point. Every row has exactly the same columns as every other row. Missing a value is still a row; a row is defined by its position, not by being complete.

Here is a tiny dataset about houseplants: Plant Name | Pot Size (cm) | Sunlight (hrs/day) | Watered Daily? | Healthy? Peacock Fern | 12 | 2 | No | Yes Basil | 8 | 6 | Yes | Yes Cactus | 6 | 8 | No | Yes Mint | 10 | 3 | Yes | No Each row is one plant. Each column records the same attribute for every plant. The last column — Healthy? — is the outcome we might want a model to predict. The other columns are the information we give the model to predict it.

Columns: The Structure of Information

Columns define what gets measured. Every column has a name and a data type. Data types matter a great deal in machine learning. Numerical columns hold numbers that can be compared or added — height in centimeters, temperature in degrees, price in dollars. Machines handle these naturally. Categorical columns hold labels from a fixed set of options — color (red, blue, green), country, species. You cannot average them the way you average numbers. Boolean columns hold exactly two values — yes/no, true/false, 0/1. Text columns hold free-form written language — product reviews, news headlines, student essays. These are the hardest for models to work with directly, because the same meaning can be expressed in countless ways.

Size Is Not Everything

A dataset with 500 carefully collected, accurate rows often trains a better model than a dataset with 50,000 sloppy, error-filled rows. Quality and quantity both matter — but quality comes first.

Match each term to its definition.

Terms

Row
Column
Numerical column
Categorical column
Dataset

Definitions

Holds labels from a fixed set of options
One example in a dataset
An attribute recorded for every example
A structured collection of examples about the same subject
Holds values that can be compared or averaged

Drag terms onto their definitions, or click a term then click a definition to match.

A table of weather records has 3,000 rows and 7 columns. How many examples does the dataset contain?

Which of the following is a categorical column?

Design Your Own Dataset

  1. Step 1: Choose a subject you know well — your classmates, your music library, the plants in your home, or anything else.
  2. Step 2: Decide what makes a single example. Write it down: 'Each row is one _____.'
  3. Step 3: List five columns you would record. For each column, write the data type: Numerical, Categorical, Boolean, or Text.
  4. Step 4: Collect or invent five real rows of data.
  5. Step 5: Look at your table. Is every row complete? Are any values missing? Note it — the next lessons explain why that matters.