Collecting Examples
A machine needs examples to learn. But where do all those examples come from? Someone has to go find them, save them, and organize them. That takes real work — and it matters a lot how it is done. Today we will follow the journey of data from the real world into a machine's learning pile.
Where Examples Come From
Examples can come from many different places: Sensors: Cameras, microphones, and thermometers all collect data automatically. A weather station records the temperature every hour. A security camera records video all day. People: Sometimes people create data on purpose. A scientist writes down how many birds she sees. Students fill out a survey. Someone types a message. Old records: Hospitals have years of health records. Libraries have millions of digitized books. The internet has billions of web pages. All of that already-saved information can become training data for a machine. Labeling: Often, after data is collected, humans have to add labels. A label tells the machine what an example IS. For instance, a person might look at a photo and type 'this is a dog.' The photo plus the label together become one useful example.
Examples come from sensors, from people, and from old records. Often people must also add labels — tags that tell the machine what each example means — before the machine can learn from it.
Imagine a team wants to teach a machine to recognize handwritten numbers. Here is how they might collect examples: Step 1: Ask thousands of volunteers to write the numbers 0 through 9 on paper. Step 2: Scan all those papers to turn the handwriting into image files. Step 3: Label each image — this one says '3,' this one says '7.' Step 4: Put all the labeled images together into a big collection — a dataset. Step 5: Let the machine study every example in the dataset. A famous dataset called MNIST has 70,000 examples of handwritten digits, all collected and labeled by real people.
Complete this sentence about how data gets ready for a machine.
Collecting examples is not always easy. It costs time. It takes effort. And not every example that gets collected is a good one — blurry photos, muffled recordings, and misspelled words can all sneak in. That is why people who build machine learning programs spend a huge amount of time thinking carefully about how to collect and organize their data.
If a label is wrong — like calling a photo of a cat 'dog' — the machine learns something false. Checking labels carefully is one of the most important jobs in building a learning machine.
What is a label in a dataset?
Which of these is a way that data examples can be collected?
Label Your Own Dataset
- Look through a magazine, newspaper, or printed photos you are allowed to cut or draw on.
- Collect five pictures of different things (animals, food, vehicles, etc.).
- Write a label under each picture — one or two words that say exactly what it shows.
- Now shuffle the pictures face-down and swap with a friend.
- Have your friend look at only the labels and guess what the pictures show.
- Did the labels give enough information? This is what it feels like to create labeled data!