(a comparison)Delving into the nature of random forest, walking through an example, and comparing it to logistic regression.

Andrew HershyBlockedUnblockFollowFollowingJul 6SourceIntroduction:Random Forests are another way to extract information from a set of data.

The appeals of this type of model are:It emphasizes feature selection — weighs certain features as more important than others.

It does not assume that the model has a linear relationship — like regression models do.

It utilizes ensemble learning.

If we were to use just 1 decision tree, we wouldn’t be using ensemble learning.

A random forest takes random samples, forms many decision trees, and then averages out the leaf nodes to get a clearer model.

In this analysis we will classify the data with random forest, compare the results with logistic regression, and discuss the differences.

Take a look at the previous logistic regression analysis to see what we‘ll be comparing it to.

Table of Contents:1.

Data Understanding (Summary)2.

Data Exploration/Visualization(Summary)4.

Building the Model5.

Testing the Model6.

ConclusionsData Background:We have a sample of 255 patients and would like to measure the relationship between 4 proteins levels and cancer growth.

We know:The concentration of each protein measured per patient.

Whether or not each patient has been diagnosed with cancer (0 = no cancer; 1= cancer).

Our goal is:To predict whether future patients have cancer by extracting information from the relationship between protein levels and cancer in our sample.

The 4 proteins we’ll be looking at:Alpha-fetoprotein (AFP)Carcinoembryonic antigen (CEA)Cancer Antigen 125 (CA125)Cancer Antigen 50 (CA50)I received this data set to use for educational purposes from the MBA program @UAB.

Data Exploration / VisualizationAgain, take a look at the logistic regression analysis to get a more in-depth understanding.

Below are the essentials:import numpy as npimport pandas as pdfrom sklearn import treefrom sklearn.

ensemble import RandomForestClassifierimport matplotlib.

pyplot as pltinputdata= r”C:UsersAndrewDesktopcea.

xlsx”df = pd.

read_excel(inputdata)df.

describe()Figure 1Target Variable (Y)yhist = plt.

hist('class (Y)', data = df, color='g')plt.

xlabel('Diagnosis (1) vs no diagnosis (0)')plt.

ylabel('Quantity')plt.

title('Class (Y) Distribution')Figure 2Building the ModelTo refresh on the logistic regression output:CEA and CA125 were the most predictive, with their pvalues below alpha at 5% and their coefficients being higher than the others.

We took out AFP and CA50 from the logistic regression due to their high pvalue.

However, we will keep them in for the random forest model.

The whole purpose of this exercise is to compare the 2 models, not combine them.

We will build the decision tree and visualize what it looks like:#Defining variables and building the modelfeatures = list(df.

columns[1:5])y = df['class (Y)']X = df[features]clf = tree.

DecisionTreeClassifier()clf = clf.

fit(X,y)#Visualizing the treefrom sklearn.

externals.

six import StringIO from IPython.

display import Image from sklearn.

tree import export_graphvizimport pydotplusdot_data = StringIO()export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True)graph = pydotplus.

graph_from_dot_data(dot_data.

getvalue()) Image(graph.

create_png())Decision Tree:Figure 3That’s an intimidating tree for new-comers.

Let’s break down the first node.

Features Key: X0 (AFP).

X1 (CEA).

X2 (CA125).

X3 (CA50)Layer 1: CEA ≤ 3.

25, gini 0.

492, spread = 144, 111The regression model told us CEA is the most predictive feature with the highest coefficient and the lowest pvalue.

The decision tree agrees with this this by placing CEA on the root node (the most important node).

The tree made the decision to split the dataset by CEA at the point 3.

25.

That is the point where CEA splits the target variable most purely into cancerous and noncancerous.

Anything lower than 3.

25 (n=144) has a stronger likelihood of being non cancerous, anything above 3.

25 (n=111) will likely be cancerous.

In general, the lower the gini score, the more purely the data is split by its target variable.

The root node is selected to be the feature with the strongest split.

The rest of the tree’s decision nodes are derivative and work in the same way.

Random ForestInstead of stopping there and basing our model off of this, we will be implementing a random forest: taking random samples, forming many decision trees and taking the average of those decisions to form a new model.

We are taking the averages of 1000 tree samples in this model.

#Importingfrom sklearn import metricsfrom sklearn.

model_selection import train_test_split as tts#Dividing into training(70%) and testing(30%)X_train, X_test, y_train, y_test = tts(X, y, test_size=0.

3, random_state=None)#Running new regression on training datatreeclass = RandomForestClassifier(n_estimators=1000)treeclass.

fit(X_train, y_train)#Calculating the accuracy of the training model on the testing datay_pred = treeclass.

predict(X_test)y_pred_prob = treeclass.

predict_proba(X_test)accuracy = treeclass.

score(X_test, y_test)print(‘The accuracy is: ‘ + str(accuracy *100) + ‘%’)This 71% accuracy compares to the 74% accuracy of the logistic model.

Testing the ModelConfusion MatrixEdit: I was talking with a friend in biostats about my analysis, and the convention in that field is that the disease is attributed as being positive.

I arbitrarily set cancer as negative because I didn’t know that at the time.

Figure 4#Confusion Matrixfrom sklearn.

metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)Match the matrix above to Figure 4 to learn what it is saying:34 of our model’s guesses were True Positive: The model thought the patient had no cancer, and they indeed had no cancer.

21 of our model’s guesses were True Negative: The model thought the patient had cancer, and they indeed had cancer.

14 of our model’s guesses were False Negative: The model thought the patient had cancer, but they actually didn’t have cancer8 of our model’s guesses were False Positive: The model thought the patient had no cancer but they actually did have cancer.

30% of our total data went to testing group, that leaves 255(.

3) = 77 instances that were tested.

The sum of the matrix is 77.

Divide the “True” numbers by the total and that will give the accuracy of our model: 55/77 = 71%.

Forming new DataFrame for Accuracy plot and ROC curve:#Formatting for ROC curvey_pred_prob = pd.

DataFrame(y_pred_prob)y_1_prob = y_pred_prob[1]y_test_1 = y_test.

reset_index()y_test_1 = y_test_1['class (Y)']X_test = X_test.

reset_index()CEA = X_test['CEA']CA125 = X_test['CA125']#Forming new df for ROC Curve and Accuracy curvedf = pd.

DataFrame({ 'CEA': CEA, 'CA125': CA125, 'y_test': y_test_1, 'model_probability': y_1_prob})df = df.

sort_values('model_probability')#Creating 'True Positive', 'False Positive', 'True Negative' and 'False Negative' columns df['tp'] = (df['y_test'] == int(0)).

cumsum()df['fp'] = (df['y_test'] == int(1)).

cumsum()total_0s = df['y_test'].

sum()total_1s = abs(total_0s – len(df))df['total_1s'] = total_1sdf['total_0s']= total_0sdf['total_instances'] = df['total_1s'] + df['total_0s']df['tn'] = df['total_0s'] – df['fp']df['fn'] = df['total_1s'] – df['tp']df['fp_rate'] = df['fp'] / df['total_0s']df['tp_rate'] = df['tp'] / df['total_1s']#Calculating accuracy columndf['accuracy'] = (df['tp'] + df['tn']) / (df['total_1s'] + df['total_0s'])#Deleting unnecessary columnsdf.

reset_index(inplace = True)del df['total_1s']del df['total_0s']del df['total_instances']del df['index']#with pd.

option_context('display.

max_rows', None, 'display.

max_columns', None): # more options can be specified alsoprint(df.

to_string())#Export the log into excel to show your friendsexport_excel = df.

to_excel (r"C:UsersAndrewDesktopdf1.

xlsx", index = None, header=True)To understand what is going on in the dataframe below, let’s analyze it, row by row.

Index: This dataframe is sorted on the model_probability, so I reindexed for convenience.

CA125 and CEA: The original testing data protein levels.

model_probability: This column is from our training data’s logistic model outputting it’s probabilistic prediction of being classified as “1” (cancerous) based on the input testing protein levels.

The first row is the least-likely instance to be classified as cancerous with it’s high CA125 and low CEA levels.

y_test: The actual classifications of the testing data we are checking our model’s performance with.

The rest of the columns are based solely on “y_test”, not our model’s predictions.

Think of these values as their own confusion matrices.

These will help us determine where the optimal cut off point will be later.

tp (True Positive): This column starts at 0.

If y_test is ‘0’ (benign), this value increases by 1.

It is a cumulative tracker of all the potential true positives.

The first row is an example of this.

fp (False Positive): This column starts at 0.

If y_test is ‘1’(cancerous), this value increases by 1.

It is a cumulative tracker of all potential false positives.

The fourth row is an example of this.

tn (True Negative): This column starts at 32(the total number of 1’s in the testing set).

If y_test is ‘1’(cancerous), this value decreases by 1.

It is a cumulative tracker of all potential true negatives.

The fourth row is an example of this.

fn (False Negative): This column starts at 45(the total number of 0’s in the testing set).

If y_test is ‘0’(benign), this value decreases by 1.

It is a cumulative tracker of all potential false negatives.

The fourth row is an example of this.

fp_rate (False Positive Rate): This is calculated by taking the row’s false positive count and dividing it by the total number of positives (45, in our case).

It lets us know the number of false positives we could classify by setting the cutoff point at that row.

We want to keep this as low as possible.

tp_rate (True Positive Rate): Also known as sensitivity, this is calculated by taking the row’s true positive count and dividing it by the total number of positives.

It lets us know the number of true positives we could classify by setting the cutoff point at that row.

We want to keep this as high as possible.

accuracy: the sum of true positive and true negative divided by the total instances (77 in our case).

Row by row, we are calculating the potential accuracy based on the possibilities of our confusion matrices.

Figure 5After looking over the confusion matrices within the dataframe, try to find the highest accuracy percentage.

If you can locate that, you can match it to the corresponding model_probability to discover the optimal cut-off point for our data.

#Plotplt.

plot(df[‘model_probability’],df[‘accuracy’], color = ‘c’)plt.

xlabel(‘Model Probability’)plt.

ylabel(‘Accuracy’)plt.

title(‘Optimal Cutoff’)Figure 6The random forest model sets the cut-off point at 60% model probability, which is at 75% accuracy.

It may seem counter-intuitive, but this means if we use 60% instead of 50% when classifying a patient as cancerous, it will actually be more accurate using this particular model.

For comparison, the logistic model set it’s optimal cut-off point at 54% probability with the same accuracy at 75%.

Lastly, let’s graph the ROC curve and find AUC:#Calculating AUCAUC = 1-(np.

trapz(df[‘fp_rate’], df[‘tp_rate’]))#Plotting ROC/AUC graphplt.

plot(df[‘fp_rate’], df[‘tp_rate’], color = ‘k’, label=’ROC Curve (AUC = %0.

2f)’ % AUC)#Plotting AUC=0.

5 red lineplt.

plot([0, 1], [0, 1],’r — ‘)plt.

xlabel(‘False Positive Rate’)plt.

ylabel(‘True Positive Rate (Sensitivity)’)plt.

title(‘Receiver operating characteristic’)plt.

legend(loc=”lower right”)plt.

show()Figure 7The black ROC curve is showing the trade-off between our testing data’s true positive rate and false positive rate.

The dotted red line cutting through the center of the graph is to provide a sense of what the worst possible model would look like as an ROC curve.

The closer the ROC line can get to the top-left side, the more predictive our model is.

The closer it resembles the dotted red line, the less predictive it is.

That’s where the area under curve (AUC) comes in.

AUC is the area of space that lies under the ROC curve.

Intuitively, the closer this is to 1, the better our classification model is.

The AUC of the dotted line is 0.

5.

The AUC of a perfect model would be 1.

Our random forest has an AUC at 0.

71.

For comparison, provided is the logistic ROC curve with an AUC of 0.

82:Figure 8ConclusionComparing the accuracy and AUC between the models, logistic regression wins this time.

Both models regardless have their pros and cons.

Please subscribe if you found this helpful.

If you enjoy my content, please check out a few other projects:Simple Linear vs Polynomial RegressionPredicting Cancer with Logistic Regression in PythonBivariate Logistic Regression Example (python)Calculating R-squared from scratch (using python)Risk Board Game Battle AutomationRisk Board Game — Battle Probability Grid Program.