Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on?Based on your project, which performance metric to improve on?Salma GhoneimBlockedUnblockFollowFollowingApr 2I will use a basic example to explain each performance metric on in order for you to really understand the difference between each one of them.
So that in your next ML project you can choose which performance metric to improve on that best suits your project.
Here we goA school is running a machine learning primary diabetes scan on all of its students.
The output is either diabetic (+ve) or healthy (-ve).
There are only 4 cases any student X could end up with.
We’ll be using the following as a reference later, So don’t hesitate to re-read it if you get confused.
True positive (TP): Prediction is +ve and X is diabetic, we want thatTrue negative (TN): Prediction is -ve and X is healthy, we want that tooFalse positive (FP): Prediction is +ve and X is healthy, false alarm, badFalse negative (FN): Prediction is -ve and X is diabetic, the worstTo remember that, there are 2 tricks- If it starts with True then the prediction was correct whether diabetic or not, so true positive is a diabetic person correctly predicted & a true negative is a healthy person correctly predicted.
Oppositely, if it starts with False then the prediction was incorrect, so false positive is a healthy person incorrectly predicted as diabetic(+) & a false negative is a diabetic person incorrectly predicted as healthy(-).
– Positive or negative indicates the output of our program.
While true or false judges this output whether correct or incorrect.
Before I continue, true positives & true negatives are always good.
we love the news the word true brings.
Which leaves false positives and false negatives.
In our example, false positives are just a false alarm.
In a 2nd more detailed scan it’ll be corrected.
But a false negative label, this means that they think they’re healthy when they’re not, which is — in our problem — the worst case of the 4.
Whether FP & FN are equally bad or if one of them is worse than the other depends on your problem.
This piece of information has a great impact on your choice of the performance metric, So give it a thought before you continue.
Which performance metric to choose?AccuracyIt’s the ratio of the correctly labeled subjects to the whole pool of subjects.
Accuracy is the most intuitive one.
Accuracy answers the following question: How many students did we correctly label out of all the students?.Accuracy = (TP+TN)/(TP+FP+FN+TN)numerator: all correctly labeled subject (All trues)denominator: all subjectsPrecisionPrecision is the ratio of the correctly +ve labeled by our program to all +ve labeled.
Precision answers the following: How many of those who we labeled as diabetic are actually diabetic?Precision = TP/(TP+FP)numerator: +ve labeled diabetic people.
denominator: all +ve labeled by our program (whether they’re diabetic or not in reality).
Recall (aka Sensitivity)Recall is the ratio of the correctly +ve labeled by our program to all who are diabetic in reality.
Recall answers the following question: Of all the people who are diabetic, how many of those we correctly predict?Recall = TP/(TP+FN)numerator: +ve labeled diabetic people.
denominator: all people who are diabetic (whether detected by our program or not)F1-scoreF1 Score is the weighted average of precision and recall.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)SpecificitySpecificity is the correctly -ve labeled by the program to all who are healthy in reality.
Specifity answers the following question: Of all the people who are healthy, how many of those did we correctly predict?Specificity = TN/(TN+FP)numerator: -ve labeled healthy people.
denominator: all people who are healthy in reality (whether +ve or -ve labeled)General NotesYes, accuracy is a great measure but only when you have symmetric datasets (false negatives & false positives counts are close), also, false negatives & false positives have similar costs.
If the cost of false positives and false negatives are different then F1 is your savior.
F1 is best if you have an uneven class distribution.
Precision is how sure you are of your true positives whilst recall is how sure you are that you are not missing any positives.
Choose Recall if the idea of false positives is far better than false negatives, in other words, if the occurrence of false negatives is unaccepted/intolerable, that you’d rather get some extra false positives(false alarms) over saving some false negatives, like in our diabetes example.
You’d rather get some healthy people labeled diabetic over leaving a diabetic person labeled healthy.
Choose precision if you want to be more confident of your true positives.
for example, Spam emails.
You’d rather have some spam emails in your inbox rather than some regular emails in your spam box.
So, the email company wants to be extra sure that email Y is spam before they put it in the spam box and you never get to see it.
Choose Specificity if you want to cover all true negatives, meaning you don’t want any false alarms, you don’t want any false positives.
for example, you’re running a drug test in which all people who test positive will immediately go to jail, you don’t want anyone drug-free going to jail.
False positives here are intolerable.
Bottom Line is— Accuracy value of 90% means that 1 of every 10 labels is incorrect, and 9 is correct.
— Precision value of 80% means that on average, 2 of every 10 diabetic labeled student by our program is healthy, and 8 is diabetic.
— Recall value is 70% means that 3 of every 10 diabetic people in reality are missed by our program and 7 labeled as diabetic.
— Specificity value is 60% means that 4 of every 10 healthy people in reality are miss-labeled as diabetic and 6 are correctly labeled as healthy.
Confusion MatrixWikipedia will explain it better than meIn the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).
Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.
commonly mislabeling one as another).
A nice & easy how-to of calculating a confusion matrix is here.
metrics import confusion_matrix>>>tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).
ravel()# true negatives, false positives, false negatives, true positives>>>(tn, fp, fn, tp)(0, 2, 1, 1).