And your metric should reflect this question.

Metrics available for Classification ProblemsI provide a list of classification metrics available in popular scikit library.

Details of how they work are visible on their linked doc pages.

Best to Worst Arrangement in my opinion (For Skewed Problems)Sensitivity ScoreSpecificity ScorePrecisionRecallBalanced AccuracyAccuracyF1 ScoreAUC-ROCAverage Precision/Area Under Precision-Recall CurveBehaviour of Classification Metrics for Imbalanced/Skewed ProblemsJupyter Notebook linkWe will create some artificial data and then skew the positive and negative classes.

Next we will use a classification algorithm and few metrics to check how the metrics react to increasing skew.

Ideally without fine tuning the model or resampling the data, the suitable metrics should go from good to worse.

On the other hand metrics which don’t take the skew into account will not show much change, or may even show improvement with greater skew.

First let’s see how imbalanced data looks like, we will use scikit’s make_moons api (covered in a previous blog about data generation).

Interactive Imbalance Plot CodeResults as we increase imbalanceIncreasing ImbalanceNotebook with code.

Next let’s see how various metrics look like under increasing imbalance.

Generating DataNext we define a function to build models continuously.

Model builder functionFinally we write a function which loops over various imbalance values, our provided metrics, and makes a graph of metrics vs imbalances.

Function to plot metrics vs imbalanceNotice that I use np.

logspace to generate imbalance values.

This helps me generate more values closer to 0, and less values closer to 1.

The function call at line 8 to imbalance imbalances the data, its a custom function which can be found in lib.

py in same folder.

Other custom functions are in the same file.

Finally let’s run it.

We use f1,accuracy,recall and precision as our metrics.

Metrics vs Imbalance — IOur observations are as follows.

Accuracy is not sensitive to imbalance at all, while precision,recall and f1 are.

Next lets try few other metrics.

We define two new metrics: precision@recall=75 and recall@precision=75 So we are keeping the recall/precision at a set threshold and then checking how the imbalance affects the other.

Metrics vs Imbalance — IINotice that all these metrics are sensitive to imbalance.

For improving them you need to improve your modelling.

Accuracy on the other hand was not sensitive to imbalance and presented a false cosy picture of good performance.

This happens because as skew increases, predicting the most frequent class will give high accuracy.

In a 1:99 skew case (1 Positive and 99 negative examples) if you predict negative always, then you are 99% accurate.

Accuracy = Num correctly predicted / Total examplesFinally we will compare AUC-ROC and average precision with accuracy.

Metrics vs Imbalance — IIINotice something, AUC-ROC is not sensitive to imbalance.

As such if you have skewed dataset then AUC-ROC is not a good metric.

Lets find out why.

Explanation of why AUC-ROC is not sensitive to imbalanceA little Reference on what is AUC-ROC first.

Another Reference.

Blatantly copy pasting from the article above(to avoid writing the same thing that the other person wrote).

Roc Curve and TermsSo the ROC curve is True-Positive vs False Positive for various thresholds (0–1).

Lowest value can be 0.

5, for random guessing.

Let’s see what it does for increasing imbalance.

What we want to prove: TPR and FPR remain same relatively as imbalance increases.

Eq 1: TPR = TP/(TP+FN)As imbalance increases, TPR will mostly remain constant since it depends on misclassifying Positive examples.

If our algorithm has 90% detection then TPR = 90/(90+10) = 0.

9, i.

e TPR doesn’t depend on skew but rather on how well our algorithm can detect the positive class only.

Eq 2: FPR = FP/(TN+FP)Here the fun happens, As skew increases we will have more FP, lets say our algorithm classifies every 1 in 100 negative example as positive (FP), then when we have high skew we will have lots of negative examples compared to positive examples, and lots of FPs.

But here we are not considering FP, we are considering FPR, notice the TN in denominator, True negatives (TN) will also increase since negative class increased.

As a result FPR remains the same as well.

Given that both equations remain same intuitively, its no surprise that AUC-ROC isn’t sensitive to skew.

Using Derived Metrics for Imbalanced/Skewed problemsI suggest using either of the 2 below based on your business requirements.

Precision@Recall=x or FPR@Recall=xRecall@Precision=xWhy these are Useful?Take an example of fraud detection, you want to detect 95% frauds, so your recall=0.

95, now you want to ensure that you don’t have too many FPs.

Precision = TP/(TP+FP), so higher your precision, lower your FPs.

Your business fixed x in our formula.

So you know recall, now you optimise your model with constant recall to improve precision or decrease FPR.

Similarly, consider drug administration (chemotherapy) for cancer.

You want to make sure that people who don’t have the disease are not administered the drug since it has huge adverse health effects.

Your hospital decided that only 1 in 1000 diagnosis can be incorrect.

Or your Precision = 999/(999+1) = 0.

999.

So your precision is fixed, now your model has to increase detection/recall.

As a result, Recall@Precision=0.

999 is a good metric for you.

Apart from using these derived metrics which are problem specific, your metrics can also be composed of multiple constraints.

For example, in the rain forecast problem we talked about you can have a metric like Precision@Recall=0.

8,Duration=7 Days , i.

e.

you want to detect 80% rainy days, and you want to predict this at least 7 days before it rains.

With these constraints you optimise precision.

Thing to look-out when Selecting the Right MetricBased on Mathematical Property of Problem and MetricIf your distribution is skewed then accuracy and AUC-ROC are not good.

Its better to use Precision/Recall or some derived metric.

Based on Business utilityDerived metrics are the winners here, since they translate best to business use case.

We showed above how Precision@Recall=x and Recall@Precision=x can very well encode your business requirements.

Little about why not to use Area Under curve/f1 score/AP as MetricsFYI: This is an opinion.

F1 FormulaNotice that F1 can be same values when precision & recall are interchanged.

Take precision=0.

9, recall=0.

3, then f1 = 0.

45, reverse them and take recall=0.

9, precision=0.

3, still f1=0.

45.

Now if you report f1 as your model metric, what is the value of precision and recall.

Is your business ready to accept both values?.Can they even understand this?Now coming to AP/AUC-ROC, they have similar issues where your Area under curve might be same, for two very different looking curves with different things that they optimise.

ConclusionLet me summarise our learningDo not use AUC-ROC, PR-Curve Area (Average Precision score) etc for business reporting.

Do not use too complex metrics like F1 Score for reporting.

Use Derived Metrics since they easily capture essence of your business easilyIf your data is imbalanced, don’t use accuracy or auc-roc.

NoteBook LinkReddit Discussion on Imbalanced datasetsBase your Model Metric on the success of your Business!!Thanks for Reading!!I solve real-world problems leveraging data science, artificial intelligence, machine learning and deep learning.

Feel free to reach out to me on LinkedIn.

.