Machine Learning Classifier evaluation using ROC and CAP CurvesKaran BhanotBlockedUnblockFollowFollowingMar 10Photo by Isaac Smith on UnsplashWhile there are several metrics such as Accuracy and Recall to measure the performance of a Machine Learning model, ROC Curve and CAP Curve are great for classification problems.
In this article, I’ll explore what ROC and CAP are and how we can use Python and a dummy dataset to create these curves.
Even after exploring many articles on CAP Curve, I couldn’t find an article that explained in detail how to create them, which is why I am writing this article.
The complete code is uploaded to the following GitHub repository.
kb22/ML-Performance-EvaluationIn this repository, I discuss about various Machine Learning Model Performance Evaluation Metrics.
comDatasetI created my own dataset.
There are two features, age and experience.
Based on these two features, the output labels are 0.
0 to represent salary less than $200K and 1.
0 to represent salary more than or equal to $200K.
Complete datasetThe GREEN points represent salary more than or equal to $200K and RED points represent salary less than $200K.
I also ensured that there is some overlap between the two classes so the data is a bit more realistic and not easily separable.
ClassificationFirst, I split the data into two sets, 70% training data and 30% testing data.
I used the Support Vector Classifier with the linear kernel to train on the training data and then tested the model on the test data.
The model achieved a score of 95%.
Classification on Test dataPerformance EvaluationReceiver Operating Characteristic (ROC) CurveThe Receiver Operating Characteristic Curve, better known as the ROC Curve, is an excellent method for measuring the performance of a Classification model.
The True Positive Rate (TPR) is plot against False Positive Rate (FPR) for the probabilities of the classifier predictions.
Then, the area under the plot is calculated.
More the area under the curve, better is the model at distinguishing between classes.
Import files and create base lineFirst, I import roc_curve and auc from sklearn.
metrics so that I can create the ROC Curve as well as calculate the Area Under Curve.
I also define the figure size as 20×12 and create a base line from (0,0) to (1,1).
The value r– indicates that the colour of the line is red and it is a dashed line ( — — — — — — — — — — — — — ).
Calculate probabilities and determine TPR and FPRNext, using predict_proba I calculate the probabilities of prediction and store it in probs.
It consists of two columns, the first one includes probabilities for first class (Salary < $200K) and the second includes probabilities for second class (Salary ≥ $200K).
So, I select the probabilities of the second class using probs[:, 1].
roc_curve generates the roc curve and returns fpr, tpr and thresholds.
Finally, using fpr and tpr as inputs inside auc, I calculate the area under this model’s curve and save it in roc_auc.
roc_auc now has the area under the curve generated by our Support Vector Classifier.
Plot the ROC CurveI plot the curve using fpr as x-values and tpr as y-values with the colour green and line width 4.
The label of this curve includes the area under the curve.
The x-axis label is set as False Positive Rate and y axis label is set as True Positive Rate.
The title is Receiver Operating Characteristic and the legend appears on the lower right corner of the figure.
The text size is set as 16.
ROC CurveThe area under curve is 0.
98 which is really amazing and provides the information that our model is performing great.
Cumulative Accuracy Profile (CAP) CurveThe CAP Curve tries to analyse how to effectively identify all data points of a given class using minimum number of tries.
In this dataset, I’m trying to identify how quickly can the Support Vector Classifier identify all individuals with salary more than or equal to $200K.
Calculate count of each classFirst, I find the total data points in the test data (60) and save it in the variable total.
The test labels are either 0.
0 or 1.
0, so if I add all values, I will get the count of class 1.
0 (31) which I can save in class_1_count.
Subtracting this number from total would give me the class_0_count (29).
I also set the figure size to 20×12 to make it bigger than normal.
Random ModelFirst, we plot a random model which is based on the fact that the correct detection of class 1.
0 will grow linearly.
The colour is red with the style as dashed defined using –.
I’ve set the label as Random Model.
Random ModelPerfect ModelNext, I plot the perfect model.
A perfect model is one which will detect all class 1.
0 data points in the same number of tries as there are class 1.
0 data points.
It takes exactly 31 tries for the perfect model to identify 31 class 1.
0 data points.
I’ve coloured the plot as grey.
The label is set as Perfect Model.
Perfect ModelTrained Model (Support Vector Classifier)Finally, I plot the results from the Support Vector Classifier.
First, like in ROC Curve, I extract the probability of class 1.
0 in the variable probs.
I zip probs and y_test together.
I then sort this zip in the reverse order of probabilities such that the maximum probability comes first and then lower probabilities follow.
I extract only the y_test values in an array and store it in model_y.
cumsum() creates an array of values while cumulatively adding all previous values in the array to the present value.
For example, if we have an array [1, 1, 1, 1, 1].
Applying cumsum would result in [1, 2, 3, 4, 5].
I use it to calculate the y-values.
Also, we need to append 0 in front of the array for the start point (0,0).
The x-values will be ranging from 0 to the total + 1.
I add one as np.
arange() does not include the end point and I want the endpoint to be total.
I then plot the result with colour blue and label Support Vector Classifier.
I’ve also included the other two models in the plot.
Support Vector ClassifierCAP Analysis using Area Under CurveThe first method to analyse the CAP Curve is using Area Under Curve.
Let’s consider area under random model as a.
We calculate the Accuracy Rate using the following steps:Calculate the area under the perfect model (aP) till the random model (a)Calculate the area under the prediction model (aR) till the random model (a)Calculate Accuracy Rate (AR) = aR / aPThe closer the Accuracy Rate is to the 1, better is the model.
Using auc, I calculated all areas and then calculated the Accuracy Rate using those values.
The rate is approximately 0.
97 which is very close to 1 and depicts that our model is really effective.
CAP Analysis using PlotAnother method to analyse the CAP Curve involves reading the plot we generated above.
Steps for the same are:Draw a vertical line at 50% from the x-axis till it crosses the Support Vector Classifier plot.
At the point, where the vertical line cuts the trained model, draw a horizontal line such that it cuts the y-axis.
Calculate the percentage of class 1 identified with respect to the total count of class 1 labels.
Once we know the percentage, we can use the following brackets to analyse it:1.
Less than 60%: Rubbish Model2.
60% — 70%: Poor Model3.
70% — 80%: Good Model4.
80% — 90%: Very Good Model5.
More than 90%: Too Good to be TrueNote that if the value is more than 90%, it’s a good practice to test for over fitting.
First, I find the index by calculating int value of 50% of total test data.
I use it to plot a vertical dashed line (— — —) from this point to the trained model.
Next, I plot the line from this point of intersection to the y-axis.
I determine the percentage by dividing the class 1.
0 values observed till now with the total class 1.
0 data points and multiplying it by 100.
I get the value as 93.
CAP Curve AnalysisEven though the percentage is 93.
55% which is greater than 90%, the result is expected.
As we looked at the dataset and classification in the beginning, the model was really effective in splitting the data.
While I used CAP Analysis on test data, we could have also used the same for training data and analysed how well our model learned about the training data.
ConclusionThe article is a summary of how to calculate ROC Curve and CAP Curve in Python and how one can analyse them.
Please feel free to share your thoughts and ideas.
Working with ROC and CAP is also new to me, so please do share any information that I might have missed.
Thanks for reading!.