The Real Problem with Accuracy
Let's start with a story.
You are a doctor. 100 patients walk into your hospital. After pathology tests, the reality is confirmed:
Actually have Cancer → 10 patients
Actually Healthy → 90 patients
Now you just say "Nobody has cancer" for all 100 patients.
Percentage = (Part / Whole) × 100
Your accuracy:
Accuracy = Correct predictions / Total
= (90 / 100) x 100
= 90% ✅
90% sounds amazing. But you just missed every single cancer patient. You are the worst doctor ever — but accuracy says you are great.
This is the problem. Accuracy lies when data is imbalanced.
Imbalanced means not equal or not properly balanced.
Simple meaning:
Examples:
Work-life imbalance → too much work, no personal time
Diet imbalance → eating too much of one type of food, not enough of others
Data imbalance (in coding/ML) → one category has way more data than another
In one line:
👉 Imbalanced = something is out of balance or not evenly distributed.
F1 Score fixes this lie.
How F1 Score Works — Step by Step
Step 1 — Understand the Setup
You build an ML Model that looks at patient data and predicts — "Cancer or No Cancer?"
The model has never seen the pathology results. It makes its own independent guesses based on symptoms, age, reports etc.
Ground Truth (Reality) → 10 patients have cancer [FIXED, never changes]
Model's Predictions → Model thinks 10 patients have cancer [Its own guess]
These are two separate lists. Now we compare them.
Step 2 — Compare Model Predictions vs Reality
The model flagged these 10 patients as "Cancer":
Model's Cancer List → P1, P2, P3, P4, P5, P6, P7, P8, P9, P10
Reality's Cancer List (from pathology):
Actual Cancer List → P1, P2, P3, P4, P5, P6, P7, P11, P12, P13
Now compare both lists:
Both lists have → P1, P2, P3, P4, P5, P6, P7 → 7 patients ✅
(Model was RIGHT about these)
Only in Model list → P8, P9, P10 → 3 patients ❌
(Model said Cancer but they were Healthy)
Only in Real list → P11, P12, P13 → 3 patients ❌
(Actually had Cancer but Model MISSED them)
This is where 7 comes from — the overlap between both lists.
Step 3 — Learn the 4 Terms
Every prediction falls into one of these 4 boxes:
MODEL SAID
Cancer No Cancer
┌───────────┬───────────┐
REALITY Cancer│ 7 │ 3 │ ← 10 actual cancer patients
├───────────┼───────────┤
Healthy│ 3 │ 87 │ ← 90 actual healthy patients
└───────────┴───────────┘
↑ ↑
Model said Model said
Cancer No Cancer
|
Term
|
Full Name
|
Meaning
|
Count
|
|
TP
|
True
Positive
|
Actually
Cancer + Model said Cancer ✅
|
7
|
|
FP
|
False Positive
|
Actually Healthy + Model said
Cancer ❌
|
3
|
|
FN
|
False
Negative
|
Actually
Cancer + Model said Healthy ❌
|
3
|
|
TN
|
True Negative
|
Actually Healthy + Model said
Healthy ✅
|
87
|
TP, FP, FN, TN — Decision Logic
Just look at 2 things:
- What is the Reality? (Ground Truth)
- What did the Model predict? (Prediction)
Decision Table
|
Reality
|
Model Said
|
Name
|
How to Remember
|
|
Cancer ✅
|
Cancer ✅
|
True Positive
|
Both same → True. Model said Positive (cancer) → Positive
|
|
Healthy ✅
|
Cancer ❌
|
False Positive
|
Both different → False. Model said Positive → Positive
|
|
Cancer ✅
|
Healthy ❌
|
False Negative
|
Both different → False. Model said Negative (healthy) → Negative
|
|
Healthy ✅
|
Healthy ✅
|
True Negative
|
Both same → True. Model said Negative → Negative
|
The Naming Formula
TRUE / FALSE → Was the model correct or wrong?
POSITIVE/NEGATIVE → What did the model predict?
That's it. True/False = model right/wrong, Positive/Negative = model's prediction.
Quick Practice
Patient P8 → Reality: Healthy, Model said: Cancer
- Was the model correct? ❌ → False
- What did the model predict? Cancer → Positive
- Answer: False Positive ✅
Patient P11 → Reality: Cancer, Model said: Healthy
- Was the model correct? ❌ → False
- What did the model predict? Healthy → Negative
- Answer: False Negative ✅
Why is FN the most dangerous?
FP → Told a Healthy person "You have Cancer"
→ Unnecessary stress, extra tests
→ But the patient will be monitored ✅
FN → Told a Cancer patient "You are Healthy"
→ Patient went home, no treatment
→ Disease keeps growing ❌ DANGEROUS
This is why Recall matters more in medical cases — because Recall directly tracks how many FNs slipped through.
Recall = TP / (TP + FN)
↑
More FN = Lower Recall
One Line to Remember
The name tells you two things at once — was the model right, and what did it predict.
Step 4 — Calculate Precision
Question Precision answers: "When the model said Cancer — how often was it actually right?"
Model said "Cancer" to 10 people. Out of those 10, only 7 actually had cancer.
Precision = TP / (TP + FP)
= 7 / (7 + 3)
= 7 / 10
= 0.70 → 70%
Real world meaning: If the model tells you "You have cancer" — there is a 70% chance it is correct. 30% chance it is wrong.
Step 5 — Calculate Recall
Question Recall answers: "Out of all patients who actually had cancer — how many did the model catch?"
10 people actually had cancer. Model caught only 7. It missed 3.
Recall = TP / (TP + FN)
= 7 / (7 + 3)
= 7 / 10
= 0.70 → 70%
Real world meaning: 3 cancer patients went home thinking they are healthy. They will never get treated. This is dangerous.
Step 6 — The Tug of War Between Precision and Recall
These two always fight each other. You cannot simply maximize one.
Scenario A — Model is too aggressive (flags everyone as Cancer):
Model said "Cancer" to all 100 patients
TP = 10, FP = 90, FN = 0
Precision = 10 / (10 + 90) = 10% ← Terrible
Recall = 10 / (10 + 0) = 100% ← Perfect
Recall is perfect but Precision is destroyed. You scared 90 healthy people unnecessarily.
Scenario B — Model is too cautious (flags only 1 person as Cancer):
Model said "Cancer" to only 1 patient — and that 1 was correct
TP = 1, FP = 0, FN = 9
Precision = 1 / (1 + 0) = 100% ← Perfect
Recall = 1 / (1 + 9) = 10% ← Terrible
Precision is perfect but Recall is destroyed. 9 sick people went home undetected.
Both extremes are bad. You need balance. That is what F1 gives you.
Step 7 — F1 Score Calculation
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= 2 × (0.70 × 0.70) / (0.70 + 0.70)
= 2 × 0.49 / 1.40
= 0.98 / 1.40
= 0.70 → 70%
Why not just take a normal average?
Look at Scenario A above:
Normal Average = (10% + 100%) / 2 = 55% ← Sounds okay
F1 Score = 2 × (0.10 × 1.0) / (0.10 + 1.0) = 18% ← Shows the truth
Normal average hides the problem. F1 exposes it immediately.
F1 is strict — both Precision AND Recall must be good. If either one is bad, F1 will be bad.
Step 8 — Why Accuracy Still Fails Here
Accuracy = (TP + TN) / Total
= (7 + 87) / 100
= 94 / 100
= 94% ← Looks great!
F1 Score = 70% ← Shows the real picture
Accuracy is high because there are 87 True Negatives dragging the number up. But those 87 are just healthy people — easy to get right. The hard part is catching cancer patients — and the model only got 70% of those.
F1 ignores True Negatives completely. It only cares about how well you handle the positive cases.
Step 9 — Python Code
from sklearn.metrics import (
f1_score, precision_score,
recall_score, confusion_matrix
)
# Ground Truth — confirmed by pathology tests
# 1 = Cancer, 0 = Healthy
y_true = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0]
# Model's predictions — its own independent guesses
y_pred = [1, 0, 1, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0]
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
accuracy = sum(p == t for p, t in zip(y_pred, y_true)) / len(y_true)
print(f"Accuracy : {accuracy:.2f}") # 0.85 — looks good but misleading
print(f"Precision : {precision:.2f}") # 0.83 — when it said cancer, 83% correct
print(f"Recall : {recall:.2f}") # 0.71 — caught 71% of actual cases
print(f"F1 Score : {f1:.2f}") # 0.77 — the honest combined score
Step 10 — When to Use What
|
Situation
|
Bigger Mistake
|
What to Focus On
|
Use
|
|
🏥 Cancer Detection
|
Missing a sick person
|
High Recall
|
F1
|
|
📧 Spam Filter
|
Deleting a real email
|
High Precision
|
F1
|
|
💳 Fraud Detection
|
Missing a fraud
|
High Recall
|
F1
|
|
🐶 Dog vs Cat
|
Balanced data, both equal
|
Either is fine
|
Accuracy
|
Golden Rule: If your dataset is imbalanced (example: 95% Healthy, 5% Cancer) — always use F1. Never trust accuracy.
Complete Summary in One Place
100 Patients Total
├── 10 actually had Cancer (Ground Truth — FIXED)
└── 90 actually Healthy (Ground Truth — FIXED)
Model independently predicted:
├── Said "Cancer" → 10 patients (model's own guess)
│ ├── 7 correct → TP = 7 (overlap between both lists)
│ └── 3 wrong → FP = 3 (healthy people wrongly flagged)
└── Said "Healthy" → 90 patients
├── 87 correct → TN = 87
└── 3 wrong → FN = 3 (cancer patients model missed)
Precision = TP / (TP + FP) = 7 / 10 = 70%
Recall = TP / (TP + FN) = 7 / 10 = 70%
F1 = 2 × (0.7 × 0.7) / (0.7 + 0.7) = 70%
Accuracy = (7 + 87) / 100 = 94% ← Misleading!
F1 Score Scale
|
F1 Score
|
Meaning
|
|
1.00
|
Perfect
model 🏆
|
|
0.75 and
above
|
Good — ready for production ✅
|
|
0.50 to
0.75
|
Average —
needs improvement ⚠️
|
|
Below
0.50
|
Bad — do not use ❌
|
|
0.00
|
Completely
useless 🗑️
|
One line to remember forever:
Accuracy tells you how often you were right overall. F1 tells you how well you handled the cases that actually mattered.