Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Accuracy and Other Scores

Imagine a hospital builds a model to detect a rare disease that affects 1% of patients. The model learns a clever trick: it always predicts 'no disease.' Its accuracy is 99% — because 99% of patients really do not have it. But the model has never correctly detected even one sick patient. Accuracy, the most natural measure of success, just completely fooled us. This lesson is about why one number is almost never enough — and what better numbers to use.

What Accuracy Actually Measures

Accuracy is the fraction of predictions the model got right: Accuracy = (Number of correct predictions) ÷ (Total predictions) It is simple and intuitive, which is why it is reported everywhere. For balanced problems — where each class appears about equally often — accuracy is a reasonable first measure. But as soon as classes are imbalanced (one class is much more common than another), accuracy becomes deceptive. The disease-detection example above is a classic case. A model that does nothing scores 99% accuracy. That is not a good model; it is a useless model that got lucky because the problem was imbalanced.

Accuracy Is Misleading on Imbalanced Data

If one class dominates the dataset, a model that always predicts that class gets high accuracy for free. Always check whether your dataset is balanced before trusting accuracy as your main metric.

To go beyond accuracy, we need to look at four outcomes separately. These four categories come from comparing what the model predicted against what was actually true: True Positive (TP): model predicted positive, truth is positive. Correct catch. True Negative (TN): model predicted negative, truth is negative. Correct clear. False Positive (FP): model predicted positive, truth is negative. False alarm. False Negative (FN): model predicted negative, truth is positive. Missed case. Arranged into a grid called a confusion matrix, these four numbers reveal exactly where a model is succeeding and failing.

Precision, Recall, and F1

From those four outcomes we can compute metrics that care about specific failure modes. Precision answers: of all the times the model said 'positive,' how often was it right? Precision = TP ÷ (TP + FP) High precision means few false alarms. Useful when false alarms are costly — like spam filters that accidentally block important emails. Recall (also called sensitivity) answers: of all the actual positives, how many did the model catch? Recall = TP ÷ (TP + FN) High recall means few missed cases. Crucial when missing a case is dangerous — like failing to detect a disease or a security threat. F1 Score is the harmonic mean of precision and recall — it balances both: F1 = 2 × (Precision × Recall) ÷ (Precision + Recall) F1 is useful when you need a single number that penalizes models that sacrifice one for the other. Back to the disease model: precision = 0 (it never predicts positive), recall = 0 (it catches no true positives), F1 = 0. Accuracy said 99%. Precision/recall/F1 all said 0. The real picture is clear now.

Pick Your Metric Based on the Cost of Errors

False positive costly (e.g., wrongly flagging safe emails as spam)? Optimize precision. False negative costly (e.g., missing a disease diagnosis)? Optimize recall. Need balance? Use F1. No single metric fits every problem.

Match each metric to what it measures.

Terms

Accuracy
Precision
Recall
F1 Score
False Negative

Definitions

Fraction of all predictions that were correct
Of all true positives, the fraction the model correctly caught
Of all predicted positives, the fraction that were truly positive
The harmonic mean of precision and recall
A true positive the model incorrectly predicted as negative

Drag terms onto their definitions, or click a term then click a definition to match.

A spam filter marks 10 emails as spam. 8 were actually spam; 2 were important emails it flagged incorrectly. What is the precision?

A disease-screening model has recall of 0.40. What does that mean?

Build a Confusion Matrix

  1. Step 1: Imagine a model that screens 100 patients for an infection. The truth is: 20 patients are infected, 80 are healthy.
  2. Step 2: The model predicts: 15 of the 20 infected patients are correctly flagged (TP=15), 5 infected patients are missed (FN=5), 10 healthy patients are incorrectly flagged (FP=10), 70 healthy patients are correctly cleared (TN=70).
  3. Step 3: Draw a 2x2 confusion matrix grid labeling the four cells TP, FP, FN, TN and filling in the numbers.
  4. Step 4: Calculate accuracy, precision, and recall using the formulas from this lesson.
  5. Step 5: Write one sentence about which metric matters most for this medical scenario and why.