Skip to main content
Machine Learning & Deep Learning

⏱ About 10 min10 XP

Collecting Examples

A machine needs examples to learn. But where do all those examples come from? Someone has to go find them, save them, and organize them. That takes real work — and it matters a lot how it is done. Today we will follow the journey of data from the real world into a machine's learning pile.

Where Examples Come From

Examples can come from many different places: Sensors: Cameras, microphones, and thermometers all collect data automatically. A weather station records the temperature every hour. A security camera records video all day. People: Sometimes people create data on purpose. A scientist writes down how many birds she sees. Students fill out a survey. Someone types a message. Old records: Hospitals have years of health records. Libraries have millions of digitized books. The internet has billions of web pages. All of that already-saved information can become training data for a machine. Labeling: Often, after data is collected, humans have to add labels. A label tells the machine what an example IS. For instance, a person might look at a photo and type 'this is a dog.' The photo plus the label together become one useful example.

The Big Idea

Examples come from sensors, from people, and from old records. Often people must also add labels — tags that tell the machine what each example means — before the machine can learn from it.

Imagine a team wants to teach a machine to recognize handwritten numbers. Here is how they might collect examples: Step 1: Ask thousands of volunteers to write the numbers 0 through 9 on paper. Step 2: Scan all those papers to turn the handwriting into image files. Step 3: Label each image — this one says '3,' this one says '7.' Step 4: Put all the labeled images together into a big collection — a dataset. Step 5: Let the machine study every example in the dataset. A famous dataset called MNIST has 70,000 examples of handwritten digits, all collected and labeled by real people.

Complete this sentence about how data gets ready for a machine.

After images are collected, humans add to tell the machine what each example means.

Collecting examples is not always easy. It costs time. It takes effort. And not every example that gets collected is a good one — blurry photos, muffled recordings, and misspelled words can all sneak in. That is why people who build machine learning programs spend a huge amount of time thinking carefully about how to collect and organize their data.

Labels Must Be Correct

If a label is wrong — like calling a photo of a cat 'dog' — the machine learns something false. Checking labels carefully is one of the most important jobs in building a learning machine.

What is a label in a dataset?

Which of these is a way that data examples can be collected?

Label Your Own Dataset

  1. Look through a magazine, newspaper, or printed photos you are allowed to cut or draw on.
  2. Collect five pictures of different things (animals, food, vehicles, etc.).
  3. Write a label under each picture — one or two words that say exactly what it shows.
  4. Now shuffle the pictures face-down and swap with a friend.
  5. Have your friend look at only the labels and guess what the pictures show.
  6. Did the labels give enough information? This is what it feels like to create labeled data!