Features: What the Model Looks At
In the last lesson you saw that a dataset is a table of rows and columns. But when a model learns, it does not look at the entire row the way a human might. It looks at specific columns — the ones you chose to feed it. Those chosen columns are called features, and picking the right ones can be the difference between a model that works and one that fails completely.
Defining a Feature
A feature is a measurable input variable that describes one aspect of an example. Every column in your dataset is a candidate feature, but not every column should become one. Return to the houseplant dataset from lesson one. The columns were: plant name, pot size, sunlight hours, watered daily, and healthy. If you want to predict whether a plant is healthy, the column 'healthy' is the answer you're trying to predict — not a feature. The columns pot size, sunlight hours, and watered daily are the features: they describe the plant in ways that might explain whether it is healthy.
A feature is a measurable input attribute you provide to a model. Features are the information the model uses to make its prediction. The thing you want to predict is not a feature — it is called the label (covered in the next lesson).
Here is a worked example. Suppose you want to predict whether an email is spam. Useful features might include: Number of exclamation marks in the subject line — easy to count, correlates with spam. Whether the word 'free' appears — a classic spam signal. Length of the email in words — spammers often write short messages. Whether the sender is in your contacts — a powerful signal. Notice each feature is concrete and measurable. 'Feels suspicious' is not a feature a machine can use; 'contains the phrase free money' is.
Choosing Features Well
Feature selection — deciding which columns to include — is one of the most important skills in machine learning. Choosing poorly hurts in two ways. Irrelevant features add noise. If you include the font size of an email when predicting spam, the model may try to use it even though font size has nothing to do with spam. That wastes the model's learning capacity and can introduce random errors. Missing features leave the model blind. If you try to predict whether a student will pass a test but you forget to include how many hours they studied, you have left out a very important signal. A good feature is: measurable, available when you need to make the prediction, and actually related to what you are predicting.
Never include as a feature any information that you would not have available at the time you make the real prediction. Example: if you want to predict tomorrow's weather, you cannot use tomorrow's humidity as a feature — you do not know it yet. Using future information makes accuracy look perfect during training and then collapse in real use. This mistake has a name: data leakage.
Complete the sentence about features.
You are building a model to predict if a library book will be returned late. Which of the following is the BEST feature to include?
What is feature selection?
Feature Hunt
- Step 1: Pick a prediction task — will it rain tomorrow, will a player win a video game match, will a student enjoy a book?
- Step 2: Brainstorm ten possible features. Write them all down.
- Step 3: For each feature, answer two questions: Is it measurable in a concrete number or category? Would I have this information at the time I need to make the prediction?
- Step 4: Cross out any feature that fails either test.
- Step 5: Rank your remaining features from most to least likely to be useful. Justify your top three in one sentence each.