Predicting Breast Cancer with Decision TreesMarco PeixeiroBlockedUnblockFollowFollowingJan 18How to implement decision trees with bagging, boosting and random forest to predict breast cancer from routine blood testsPhoto by Hello I'm Nik on UnsplashIn a previous post, I introduced the theory of decision trees and its performance can be improved using bagging, boosting or random forests.
Now, we implement these techniques to predict breast cancer from routine blood tests.
Many datasets about breast cancer contain information about the tumor.
However, I was lucky to find a dataset that contains routine blood tests information of patients with and without breast cancer.
Potentially, if we can accurately predict if a patient has cancer, that patient could receive very early treatments, even before a tumor is noticeable!Of course, the dataset and full notebook are available here.
As always, I strongly suggest you code along!Probably the only “boobs” GIF I could put here…Exploratory data analysisBefore starting our work on Jupyter, we can gain information about the dataset here.
First, you notice that the dataset is very small, with only 116 instances.
This poses several challenges, because the decision trees might overfit the data, or our predictive model might not be the best, due to the lack of other observations.
Yet, it is a good proof-of-concept that might demonstrate a real potential of predicting breast cancer from a simple blood test.
The dataset contains only the following ten attributes:Age: age of the patient (years)BMI: body mass index (kg/m²)Glucose: glucose concentration in blood (mg/dL)Insulin: insulin concentration in blood (microU/mL)HOMA: Homeostatic Model Assessment of Insulin Resistance (glucose multiplied by insulin)Leptin: concentration of leptin — the hormone of energy expenditure (ng/mL)Adiponectin: concentration of adiponectin — a protein regulating glucose levels (micro g/mL)Resistin: concentration of resistin — a protein secreted by adipose tissue (ng/mL)MCP.
1: concentration of MCP-1 — a protein that recruits monocytes to the sites of inflammation due to tissue injury or inflammation (pg/dL)Classification: Healthy controls (1) or patient (2)Now that we know what we will be working with, we can start by importing our usual libraries:Then, define the path to the dataset and let’s preview it:Great!.Now, because this is a classification problem, let’s see if the classes are balanced:The result should be:Classification profileAs you can see, there is almost the same number of patients and healthy controls.
Now, it would be interesting to see the distribution and density of each feature for healthy people and patients.
To do so, a violin plot is ideal.
It shows both the density and distribution of a feature in a single plot.
Let’s have nine violin plots: one for each feature:Take time to review all the plots and try to find some differences between healthy controls and patients.
Finally, let’s check if we have missing values:You should see that none of the columns have missing values!.We are now ready to start modelling!ModellingFirst, we need to encode the classes to 0 and 1:Now, 0 represents a healthy control, and 1 represents a patient.
Then, we split the dataset into a training and test set:Before writing our models, we need to define the appropriate error metric.
In this case, since it is a classification problem, we could use a confusion matrix and use the classification error.
Let’s write a helper function to plot the confusion matrix:Awesome!.Now, let’s implement a decision tree.
Decision treeUsing scikit-learn, a decision tree is implemented very easily:You should get the following confusion matrix:Confusion tree for a basic decision treeAs you can see, it misclassified three instances.
Therefore, let’s see if bagging, boosting or random forest can improve the performance of the tree.
BaggingTo implement a decision tree with bagging, we write the following:And you get the following confusion matrix:Confusion matrix for baggingAmazing!.The model classified correctly all instances in the test set!.For the sake of getting more practice, let’s also implement a random forest classifier and use boosting.
Random forest classifierHere, for the random forest classifier, we specify the number of trees we want.
Let’s go with 100:And you get this confusion matrix:Random forest classifier confusion matrixHere, although only one instance was misclassified, the model in fact said that a patient was healthy, when in fact the person had cancer!.This is a very undesirable situation.
BoostingFinally, for boosting:And we get the following:Boosting confusion matrixAgain, only one instance was misclassified.
We have seen how to implement a decision tree and how to improve its performance with boosting, bagging, and random forest.
It seems that bagging gave the best results, as it classified all instances correctly.
However, you must keep in mind that the dataset was very small.
Although it shows that we can potentially predict breast cancer from a blood test, the algorithm is unlikely to perform well on unseen data, because there is simply not enough data.
I hope you enjoyed implementing these algorithms!.I’ll be happy to answer any questions you have!Cheers!.