Balanced vs. Skewed Data
Imagine you are building a model to detect a rare disease. You gather 1,000 medical records. Only 10 of those patients actually have the disease — the other 990 are healthy. You train a model and it achieves 99 percent accuracy. Impressive? Not really. A model that simply guesses 'healthy' for every single patient, without looking at a single feature, would also score 99 percent. Your model may have learned nothing.
Class Balance and Skew
In a classification problem, each label value is called a class. A binary classifier has two classes — spam or not spam, fraud or legitimate, disease or healthy. A balanced dataset has roughly equal numbers of examples from each class. If 500 emails are spam and 500 are not spam, the dataset is balanced. A skewed dataset — also called imbalanced — has many more examples of one class than another. If 950 emails are not spam and only 50 are spam, the dataset is skewed toward the majority class (not spam). Skew is extremely common in the real world. Fraud is rare. Disease is rare. Equipment failures are rare. Most of the time, the 'interesting' event that a model needs to catch is the minority class.
On a skewed dataset, raw accuracy is a misleading metric. A model that always predicts the majority class will be highly accurate but completely useless — it will never detect the rare event you care about. To measure a model honestly on skewed data, practitioners look at recall (did it find the rare cases?) and precision (when it flagged something, was it right?).
Here is a concrete demonstration. Dataset: 990 legitimate transactions, 10 fraudulent ones. Model A (lazy model): predicts 'legitimate' for every transaction. Accuracy: 990/1000 = 99%. Fraud detected: 0 out of 10. Useless. Model B (useful model): misclassifies 20 legitimate transactions as fraud, but catches 8 of the 10 actual frauds. Accuracy: 970/1000 = 97%. Lower accuracy — but dramatically more useful. This is why the metric you choose matters as much as the data you collect.
What Can You Do About Skew?
Data scientists use several strategies when facing imbalanced data. Oversampling: duplicate or synthesize more examples of the minority class. You artificially increase the representation of fraud cases, for instance, so the model sees more of them during training. Undersampling: remove some examples of the majority class. Fewer legitimate transactions means the ratio improves, though you are throwing away data. Adjusting class weights: tell the model that a mistake on the minority class costs more than a mistake on the majority class. Most ML frameworks support this directly. Gathering more data: the best solution when possible — especially more examples of the minority class from the real world.
Before training any classifier, count how many examples belong to each class. A single line of code — or a quick tally in a spreadsheet — can save you hours of confusion later. If the split is more extreme than about 80-20, plan to address the imbalance.
Prompt Challenge
You are advising a team building a model to detect wildfires from satellite images. The dataset has 10,000 images: 9,700 show no fire, 300 show active fires. Ask the AI assistant for practical advice on handling this class imbalance.
Your prompt should…
- State the exact class counts and the ratio
- Ask for at least two concrete strategies to address the imbalance
- Request guidance on which evaluation metric to use instead of raw accuracy
A model classifies loan applications as 'approved' or 'denied.' The dataset has 9,500 approved and 500 denied applications. If a model always predicts 'approved,' what is its accuracy?
Which technique involves creating additional examples of the minority class to improve class balance?
Flip and Count
- Step 1: Simulate a skewed dataset. Take a coin. Flip it 30 times. Record each result as heads (H) or tails (T).
- Step 2: Count how many heads and tails you got. This is your class distribution.
- Step 3: Now pretend your 'model' always guesses whichever side appeared more. Calculate its accuracy.
- Step 4: Is that accuracy impressive? Why or why not?
- Step 5: Redesign: how would you collect a more balanced dataset if heads and tails were a real prediction problem?
- Step 6: Write one sentence connecting this exercise to fraud detection.