Predicting Cancer with Logistic Regression in PythonUnderstanding the data, logistic regression, testing data, confusion matrices, ROC curveAndrew HershyBlockedUnblockFollowFollowingJul 1SourceIntroduction:In my first logistic regression analysis, we merely scratched the surface.

Discussed were only high level concepts and a bivariate model example.

In this analysis we will look at more challenging data and learn more advanced techniques and interpretations.

Table of Contents:1.

Data Background2.

Data Exploration/Cleaning3.

Data Visualization4.

Building the Model5.

Testing the ModelData Background:Measuring certain protein levels in the body have been proven to be predictive in diagnosing cancer growth.

Doctors can perform tests to check these proteins levels.

We have a sample of 255 patients and would like to gain information with regard to 4 proteins and their potential relationships with cancer growth.

We know:The concentration of each protein measured per patient.

Whether or not each patient has been diagnosed with cancer (0 = no cancer; 1= cancer).

Our goal is:To predict whether future patients have cancer by extracting information from the relationship between protein levels and cancer in our sample.

The 4 proteins we’ll be looking at:Alpha-fetoprotein (AFP)Carcinoembryonic antigen (CEA)Cancer Antigen 125 (CA125)Cancer Antigen 50 (CA50)I received this data set to use for educational purposes from the MBA program @UAB.

Data Exploration / CleaningLet’s jump into the analysis by pulling in the data and importing necessary modules.

%matplotlib inlineimport numpy as npimport pandas as pdimport matplotlib.

pyplot as pltfrom sklearn.

model_selection import train_test_splitimport seaborn as snsdf = pd.

read_excel(r"C:UsersAndrewDesktopcea.

xlsx")df.

head()Figure 1df.

head() gives us the first 5 features of our data set.

Each row is a patient and each column contains a descriptive attribute.

Class (Y) describes if the patient has no cancer (0) or has cancer (1).

The next 4 columns are the protein levels found in that patient’s bloodstream.

df.

describe()Figure 2We can retrieve some basic information about the sample from the describe method.

There are 255 rows in this data set with varying std and means.

An insight worth mentioning is that the 4 proteins all have lower 50% values compared to their means.

This implies that the majority of protein-levels are small.

The 4 proteins additionally have high ranges, which implies there are high-value outliers.

For example, if we look at AFP, the mean is 4.

58, median is 3.

86, and the highest value is 82.

26.

df.

isnull().

sum()It’s always good to check if there are null (nonexistent) values in the data.

This can be addressed in various ways.

Luckily we don’t have to worry about nulls this time.

Data VisualizationLet’s visualize each of our variables and hypothesize what could be going on based on what we see:Target Variable (Y)yhist = plt.

hist('class (Y)', data = df, color='g')plt.

xlabel('Diagnosis (1) vs no diagnosis (0)')plt.

ylabel('Quantity')plt.

title('Class (Y) Distribution')Figure 3There are more cancer-free patients in this sample.

The mean of the “class (Y)” variable as shown in Figure 2 is 0.

44.

AFP#setting the axesaxes = plt.

axes()axes.

set_xlim([0,(df['AFP'].

max())])#making histogram with 20 binsplt.

hist('AFP', data = df, bins = 20)plt.

xlabel('AFP Level')plt.

ylabel('Quantity')plt.

title('AFP Distribution')Figure 4As concluded earlier, there are a relatively high number of patients with low levels of AFP.

#color palettepal = sns.

color_palette("Set1")#setting variable for max level of protein in datasetlim = df['AFP'].

max()#setting bin size to be 20bins = np.

linspace(0,lim,(lim/(lim*(1/20))))#creating new column in df with bin categories per featuredf['AFP_binned'] = pd.

cut(df['AFP'], bins)#creating a crosstab stacked bar chart variablechart = pd.

crosstab(df['AFP_binned'],df['class (Y)'])#normalizing chart and plotting chartchart.

div(chart.

sum(1).

astype(float), axis=0).

plot(kind='bar', color = pal,stacked=True)plt.

xlabel('Bins')plt.

ylabel('Quantity')plt.

title('Normalized Stacked Bar Chart: AFP vs Class(Y)')Figure 5Complementing the distribution histogram, the stacked bar chart above displays the proportion of 1’s to 0’s as the AFP levels increase.

This chart, like the distribution histogram, is also separated into 20 bins.

Combining our knowledge from the distribution and the proportions of our target variable above, we can intuitively determine there likely isn’t much predictive knowledge to be gained from this protein.

Let’s take it step by step.

The majority of patients have an AFP value of under 10, which is shown in the first 2 bars in Figure 4.

Because the majority of patients are in those first 2 bars, the change in Y between them in Figure 5 matters more than the changes in Y among the other bars.

The proportion of cancerous patients increases slightly from bar 1 to bar 2.

The proportion of cancerous patients decreases from bar 2 to 3.

After bar 3, there are so few patients left to analyze that it has little effect on the trend.

From what we can see here, the target variable looks mostly independent to changes in AFP.

The most significant change (bar 1 to 2) is very slight and the changes after that are not in the same direction.

Let’s see how the other proteins look.

CEAFigure 6Figure 7CEA appears to have a different story.

Figure 6 shows the distribution shape is similar to AFP; however, Figure 7 shows different changes in cancer rates.

Just like with AFP (due to the distribution shape) the most significant cancer change between bars would be among bar 1 and 2.

The change from bar 1 to bar 2 went from around 63% noncancerous to 18% noncancerous (or to put that another way, 37% cancerous to 82% cancerous).

Additionally, the change from bin 2 to bin 3 is in the same direction, more cancer.

The outliers starting on bin 5 with 100% cancer reinforces the trend that higher CEA likely indicates cancer.

CA125Figure 8Figure 9CA125 is a bit trickier.

Bar 1 to 2 indicates, like CEA, that a higher level of this protein could cause cancer.

However, it appears there could be 2 trends in this one.

The trend reverses as almost all of the latter bins turn noncancerous.

We will look at this variable in more detail later.

Figure 10Figure 11CA50 doesn’t look promising.

The first 4 bins appear to indicate a trend of higher cancer rates.

However, the trend looks to reverse in bins 7–9.

There is likely a small or negligible relationship between CA50 levels and cancer.

Building the ModelLet’s put together this model and see what the regression can tell us.

#importing moduleimport statsmodels.

api as sm#setting up X and ycols= [‘AFP’,’CEA’,’CA125',’CA50']X= df[cols]y = df[‘class (Y)’]#filling in the statsmodels Logit methodlogit_model = sm.

Logit(y,X)result = logit_model.

fit()print(result.

summary())Figure 12The highlighted values are what matter from this report: our 4 independent variables and their p values.

AFP and CA50 have high p values.

If our alpha is 0.

05, then AFP and CA50 have pvalues too high to reject our null hypothesis (Our null hypothesis is that the proteins levels have no impact on cancer rates).

CEA and CA125 pass the test, however, and are determined to be significant.

Both AFP and CA50 were hypothesized to be disregarded based on what we saw on our stacked bar charts, so it makes sense.

Let’s take those variables out and run the regression a second time:#deleted p values above the 0.

05 alpha thresholdcols= ['CEA','CA125']X= df[cols]y = df['class (Y)']logit_model = sm.

Logit(y,X)result = logit_model.

fit()print(result.

summary())Figure 13With our final coefficients, we have more insight about the relationship between each remaining protein and cancer.

CEA has a positive relationship about 3 times stronger than CA125's negative relationship.

As CEA increases, the likelihood for cancer increases.

As CA125 increases, the likelihood of cancer decreases.

Testing the ModelWe will divide our sample data into training and testing to test our regression results.

from sklearn.

linear_model import LogisticRegressionfrom sklearn import metrics#shuffling dfdf = df.

sample(frac=1).

reset_index(drop=True)#redefining columns cols= ['CEA','CA125']X= df[cols]y = df['class (Y)']#Dividing into training(70%) and testing(30%)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.

3, random_state=0)#Running new regression on training datalogreg = LogisticRegression()logreg.

fit(X_train, y_train)#Calculating the accuracy of the training model on the testing dataaccuracy = logreg.

score(X_test, y_test)print('The accuracy is: ' + str(accuracy *100) + '%')A good way to visualize the accuracy calculated above is with the use of a confusion matrix.

Below is the conceptual framework for what a confusion matrix is:Figure 14“Confusing” is the key word for a lot of people.

Try to look at one line at a time: The top row is a good place to start.

This row is telling us how many instances were predicted to be benign.

If we look at the columns, we can see the split of the actual values within that prediction.

Just remember, rows are predictions and columns are the actual values.

from sklearn.

metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)Match the matrix above to Figure 14 to learn what it is saying:39 of our model’s guesses were True Positive: The model thought the patient had no cancer, and they indeed had no cancer).

18 of our model’s guesses were True Negative: The model thought the patient had cancer, and they indeed had cancer.

14 of our model’s guesses were False Negative: The model thought the patient had cancer, but they actually didn’t have cancer6 of our model’s guesses were False Positive: The model thought the patient had no cancer but they actually did have cancer.

30% of our total data went to testing group, that leaves 255(.

3) = 77 instances that were tested.

The sum of the matrix is 77.

Divide the “True” numbers by the total and that will give the accuracy of our model: 57/77 = 74.

03%.

Keep in mind, we randomly shuffled the data before performing this test.

I ran the regression a few times and got anywhere between 65% and 85% accuracy.

ROC CurveLastly, we are going to perform a Receiver operating characteristic (ROC) analysis as another way of testing our model.

The 2 purposes of this test are toDetermine where the best “cut off” point is.

Determine how well the model classifies through another metric called “Area under curve” (AUC).

We will be creating our ROC curve from scratch.

Below is all the code used to format a new dataframe to calculate the ROC, cutoff point, and AUC.

#Formatting y_test and y_predicted probabilities for ROC curvey_pred_prob = pd.

DataFrame(y_pred_prob)y_1_prob = y_pred_prob[1]y_test_1 = y_test.

reset_index()y_test_1 = y_test_1['class (Y)']#Forming new df for ROC Curve and Accuracy curvedf = pd.

DataFrame({ 'y_test': y_test_1, 'model_probability': y_1_prob})df = df.

sort_values('model_probability')#Creating 'True Positive', 'False Positive', 'True Negative' and 'False Negative' columns df['tp'] = (df['y_test'] == int(0)).

cumsum()df['fp'] = (df['y_test'] == int(1)).

cumsum()total_0s = df['y_test'].

sum()total_1s = abs(total_0s – len(df))df['total_1s'] = total_1sdf['total_0s']= total_0sdf['total_instances'] = df['total_1s'] + df['total_0s']df['tn'] = df['total_0s'] – df['fp']df['fn'] = df['total_1s'] – df['tp']df['fp_rate'] = df['fp'] / df['total_0s']df['tp_rate'] = df['tp'] / df['total_1s']#Calculating accuracy columndf['accuracy'] = (df['tp'] + df['tn']) / (df['total_1s'] + df['total_0s'])#Deleting unnecessary columnsdf.

reset_index(inplace = True)del df['total_1s']del df['total_0s']del df['total_instances']del df['index']dfTo understand what is going on in the dataframe below, let’s analyze it, row by row.

Index: This dataframe is sorted on the model_probability, so I reindexed for convenience.

CA125 and CEA: The original testing data protein levels.

model_probability: This column is from our training data’s logistic model outputting it’s probabilistic prediction of being classified as “1” (cancerous) based on the input testing protein levels.

The first row is the least-likely instance to be classified as cancerous with it’s high CA125 and low CEA levels.

y_test: The actual classifications of the testing data we are checking our model’s performance with.

The rest of the columns are based solely on “y_test”, not our model’s predictions.

Think of these values as their own confusion matrices.

These will help us determine where the optimal cut off point will be later.

tp (True Positive): This column starts at 0.

If y_test is ‘0’ (benign), this value increases by 1.

It is a cumulative tracker of all the potential true positives.

The first row is an example of this.

fp (False Positive): This column starts at 0.

If y_test is ‘1’(cancerous), this value increases by 1.

It is a cumulative tracker of all potential false positives.

The fourth row is an example of this.

tn (True Negative): This column starts at 32(the total number of 1’s in the testing set).

If y_test is ‘1’(cancerous), this value decreases by 1.

It is a cumulative tracker of all potential true negatives.

The fourth row is an example of this.

fn (False Negative): This column starts at 45(the total number of 0’s in the testing set).

If y_test is ‘0’(benign), this value decreases by 1.

It is a cumulative tracker of all potential false negatives.

The fourth row is an example of this.

fp_rate (False Positive Rate): This is calculated by taking the row’s false positive count and dividing it by the total number of positives (45, in our case).

It lets us know the number of false positives we could classify by setting the cutoff point at that row.

We want to keep this as low as possible.

tp_rate (True Positive Rate): Also known as sensitivity, this is calculated by taking the row’s true positive count and dividing it by the total number of positives.

It lets us know the number of true positives we could classify by setting the cutoff point at that row.

We want to keep this as high as possible.

accuracy: the sum of true positive and true negative divided by the total instances (77 in our case).

Row by row, we are calculating the potential accuracy based on the possibilities of our confusion matrices.

Figure 15I pasted the entire dataframe because it’s worthwhile to study it for a while and make sense out of all the moving pieces.

After looking it over, try to find the highest accuracy percentage.

If you can locate that, you can match it to the corresponding model_probability to discover the optimal cut-off point for our data.

#Plottingplt.

plot(df['model_probability'],df['accuracy'], color = 'c')plt.

xlabel('Model Probability')plt.

ylabel('Accuracy')plt.

title('Optimal Cutoff')#Arrow and Starplt.

plot(0.

535612, 0.

753247, 'r*')ax = plt.

axes()ax.

arrow(0.

41, 0.

625, 0.

1, 0.

1, head_width=0.

01, head_length=0.

01, fc='k', ec='k')plt.

show()Figure 16The model probability is 54% where the accuracy is highest at 75%.

It may seem counter-intuitive, but that means if we use 54% instead of 50% when classifying a patient as cancerous, it will actually be more accurate.

If we want to maximize the accuracy, we would set the threshold to 54%, however, due to the extreme nature of cancer, it is probably wise to lower our threshold to below 50% to ensure patients who may have cancer are checked out anyway.

In other words, false positives are more consequential than false negatives when it comes to cancer!Lastly, let’s graph the ROC curve and find AUC:#Calculating AUCAUC = 1-(np.

trapz(df[‘fp_rate’], df[‘tp_rate’]))#Plotting ROC/AUC graphplt.

plot(df[‘fp_rate’], df[‘tp_rate’], color = ‘k’, label=’ROC Curve (AUC = %0.

2f)’ % AUC)#Plotting AUC=0.

5 red lineplt.

plot([0, 1], [0, 1],’r — ‘)plt.

xlabel(‘False Positive Rate’)plt.

ylabel(‘True Positive Rate (Sensitivity)’)plt.

title(‘Receiver operating characteristic’)plt.

legend(loc=”lower right”)Figure 17The black ROC curve is showing the trade-off between our testing data’s true positive rate and false positive rate.

The closer this line can get to the top-left side, the more predictive our model is.

That’s where the area under curve (AUC) comes in.

AUC is, like it sounds, the area of space that lies under the ROC curve.

Intuitively, the closer this is to 1, the better our classification model is.

Our line with an AUC of 0.

82 fairs pretty well.

Please subscribe if you found this helpful.

Other articles by me if you want to learn more:Find out about polynomial regressions hereFind out about rsquared here:.