For example, we would all be furious if our email program’s spam classifier was only able to detect 50% of those unwanted email or solicitations.
With this post I go over how to evaluate a predictive model with a classical tool that every data scientist should be familiar with: the Receiver Operating Characteristic (ROC) curve.
Illustrative Example: Predicting Coronary DiseaseThe data set I’m using for this post comes from “Predictive value of targeted proteomics for coronary plaque morphology in patients with suspected coronary artery disease” by Bom et al, and is available to the public through https://data.
This study examined two different outcomes, and the one I will focus on for this post is the absence of coronary artery disease (CAD).
The authors assess the predictive ability of proteomic biomarkers to detect the absence of CAD in symptomatic patients.
Identifying a panel of proteins that can distinguish between patients without CAD and those that require immediate intervention would provide a more accurate and cost-efficient non-invasive diagnostic test.
This data set is an excellent source to test several data science topics.
With a small number of observations, it’s easy to work with, but it also includes a large number of variables for some added complexity.
A viable “gold standard” for the outcomes (using coronary computed tomography angiography, or CCTA) is provided to test the predictions against.
For this post, I will mostly focus on the construction of the ROC curve from a single model, but may dive into more advanced topics with this data set in later posts.
An Overview of Predictive Accuracy MeasurementsBefore getting into creating the curve, it’s important to understand a few common metrics to assess predictive accuracy.
Positive Class: I’ll define the positive class as the outcome class I am trying to detect.
In this case, it is the absence of CAD.
While I realize this may cause confusion with the terms “positive” and “negative” in the world of diagnostic testing, defining the positive class this way is more generalizable to other situations.
The default method of dealing with symptomatic patients in this case is to subject them for further testing and procedures.
By sufficiently detecting patients without CAD, we would be eliminating the need for unnecessary, more-invasive procedures.
General Accuracy: Simply, how many subjects where classified correctly?Sensitivity: The proportion of true positives identified correctly.
In this case, the proportion of healthy patients correctly identified by the diagnostic tool.
This is sometimes referred to as “recall.
”SN = True Positives / (True Positives + False Negatives)inverse (1-sensitivity) = false negative rate.
Healthy patients not detected by the tool were falsely identified as having CAD.
False negatives are also known as Type II error.
Specificity: The proportion of true negatives identified correctly.
In this case, the proportion of patients with CAD correctly identified by the diagnostic tool.
SP = True Negatives / (True Negatives + False Positives)inverse (1-specificity) = false positive rate.
Patients with CAD were falsely identified as being CAD-free.
False positives are also known as Type I error.
Positive Predictive Value: The proportion of positives reported by the tool that, in reality, are positive.
For the group of patients where the diagnostic tool reports absence of CAD, PPV is the proportion of the patients that actually do not have the disease.
This is sometimes referred to as “precision.
”PPV = True Positives / (True Positives + False Positives)Negative Predictive Value: The proportion of negatives reported by the tool that, in reality, are negative.
For the group of patients where the diagnostic tool reports a presence of CAD, NPV is the proportion of the patients that actually do not have CAD.
NPV = True Negatives / (True Negatives + False Negatives)Don’t worry if your brain hurts trying to sort out all these metrics.
It’s easy to get the terms mixed up, especially when first learning them.
I’ve found it easiest to visualize things within a 2×2 table.
Confusion Matrix for a Binary Class ( positive = absence of CAD, negative = CAD)Anatomy of a CurveThe nice thing about the ROC curve is that it is an easy-to-interpret graphical tool that can be applied to any predictive model you create.
Here are the basics of the curve:The Axes: Sensitivity and False Positive RateFirst, we need to create the space for the plot.
The ROC curve is built by plotting the sensitivity against the 1-specificity (or false positive rate).
Predicted ProbabilitiesNow we need something to plot.
Recall that a predictive model will assign each observation to the most probable class (in this case, absence of CAD vs presence of CAD).
What a model is actually doing is calculating a probability of belonging to a particular class.
A cutoff value is selected between 0 and 1, and if the calculated probability is over that threshold, the observation is assigned to the class.
You may find that in most packages, the default cutoff is set to 0.
5, with the logic being that, in binary classes, one would assign an observation to a class that is most probable.
However, as we’ll see, it is best to choose the cutoff value after taking into consideration the trade offs between sensitivity and specificity.
The ROC curve is generated by plotting all possible cutoff values, which are the probabilities assigned to each observation.
Selecting a different cutoff value will alter the sensitivity and specificity of the prediction tool, thus, each cutoff probability can be plotted in the space of the plot using the associated sensitivity and 1-specificity as the coordinates.
The point closest to the top-left corner (SN = 1, FPR = 0) provides the most balance between the accuracy metrics.
Area Under the Curve (AUC)The AUC is a metric that is analogous to a binary model’s concordance, or c-statistic.
This is the probability that an observation with a positive class will have a greater predicted probability than an observation in a negative class.
If AUC = 1, it means there is perfect prediction by the model.
If AUC = 0.
5, it would mean the model is unable to discriminate between classes.
It also behaves similarly to r-square in logistic regression, in that adding more predictors will increase the AUC.
Therefore, it is important to include cross-validation or validation on external data in the analysis.
The AUC can be used to assess different predictive models.
Reference LineIt’s generally a good idea to draw the reference line, where AUC = 0.
5 on the ROC plot.
This provides a baseline visual to compare the curve against.
Generating the ROC CurveWhile several different programs can be used to develop a predictive model and the ROC curve, I have implemented the analysis in R.
Complete code for analysis of this data set can be found on my Github:dslans/predicting-cadData analysis of the targeted proteomics data set to predict coronary artery disease among symptomatic patients …github.
comTo predict the outcome, I created a classification model using xgboost with the protein biomarkers as predictors and used a resubstitution method to predict outcomes in the data set.
Using the predicted probabilities, I can form the ROC curve that I showed above.
Note: The program is also set up to use k-fold cross-validation, or can be updated to utilize external validation, like a 70/30 split for the train and test sets (although I would want more data for that).
Interpreting the CurveThe AUC is at 0.
738, which is actually just below what the authors found with their machine learning methods (stacking may have boosted the classifier accuracy).
However, I am more concerned with selecting an appropriate cutoff and weighing the costs and benefits of the tool.
Overall accuracy of the model is decent, but happens when a false positive result occurs, and a patient with CAD is sent home instead of going through necessary procedures?.Let’s just agree that wouldn’t be a good thing.
To be safe, it’s probably a good idea to keep the false positive rate of the diagnostic tool as small as possible.
In generating the confusion matrix, it’s easy to select a probability cutoff that will maintain a low false positive rate.
This will act as decision criteria for who gets labeled as CAD-free.
By selecting a probability cutoff that maintains 90% specificity (10% false positive rate), the classification tool was able to detect 42% of patients without CAD (sensitivity = 0.
This is actually a really great result in this scenario, because the status quo is to subject all symptomatic patients to more invasive diagnostic procedures.
Even though a sensitivity of 42% may sound low, in this specific scenario, it’s a promising result.
In developing the classifier for predicting CAD status among symptomatic patients, it is most important to achieve the highest possible sensitivity while maintaining a low false positive rate.
The beauty of the ROC curve is that you can visualize all these performance metrics from a single image.
Comparing the curves against competing models is a quick and easy way to select an appropriate classification or diagnostic tool.
I only explored using the ROC curve with a binary classification tool, but it can easily be extended to a multiple class scenario.
On the same graphic, it’s possible to plot multiple curves corresponding to the probability of being in each of the specific classes.
In writing this article, I went into detail on how to construct the ROC curve, with the goal of increasing the understanding around predictive accuracy measures and how you can assess these data science classification tools that you may be building.
ReferencesBom MJ, Levin E, Driessen RS, et al.
Predictive value of targeted proteomics for coronary plaque morphology in patients with suspected coronary artery disease.