Exploratory Data Analysis: Haberman’s Cancer Survival DatasetDeepthi A RBlockedUnblockFollowFollowingJul 7What is Exploratory Data Analysis?Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.

It is always a good idea to explore a data set with multiple exploratory techniques, especially when they can be done together for comparison.

The goal of exploratory data analysis is to obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm.

Another side benefit of EDA is to refine your selection of feature variables that will be used later for machine learning.

Why EDA?In a hurry to get to the machine learning stage, some data scientists either entirely skip the exploratory process or do a very perfunctory job.

This is a mistake with many implications, including generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently because of realizing only after generating models that perhaps the data is skewed, or has outliers, or has too many missing values, or finding that some values are inconsistent.

In this blog, we take Haberman’s Cancer Survival Dataset and perform various EDA techniques using python.

You can easily download the dataset from Kaggle.

Haberman’s Survival Data SetSurvival of patients who had undergone surgery for breast cancerwww.

kaggle.

comEDA on Haberman’s Cancer Survival Dataset1.

Understanding the datasetTitle: Haberman’s Survival DataDescription: The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:Age of patient at the time of operation (numerical)Patient’s year of operation (year — 1900, numerical)Number of positive axillary nodes detected (numerical)Survival status (class attribute) : 1 = the patient survived 5 years or longer 2 = the patient died within 5 years2.

Importing libraries and loading the fileimport pandas as pdimport seaborn as snsimport matplotlib.

pyplot as pltimport numpy as np#reading the csv filehaber = pd.

read_csv(“haberman_dataset.

csv”)3.

Understanding the data#Prints the first 5 entries from the csv filehaber.

head()Output:#prints the number of rows and number of columnshaber.

shapeOutput: (306, 4)Observation:The CSV file contains 306 rows and 4 columns.

#printing the columnshaber.

columnsOutput: Index([‘age’, ‘year’, ‘nodes’, ‘status’], dtype=’object’)print(haber.

info())#brief info about the datasetOutput:<class 'pandas.

core.

frame.

DataFrame'>RangeIndex: 306 entries, 0 to 305Data columns (total 4 columns):age 306 non-null int64year 306 non-null int64nodes 306 non-null int64status 306 non-null int64dtypes: int64(4)memory usage: 9.

6 KBObservations:There are no missing values in this data set.

All the columns are of the integer data type.

The datatype of the status is an integer, it has to be converted to a categorical datatypeIn the status column, the value 1 can be mapped to ‘yes’ which means the patient has survived or longer.

And the value 2 can be mapped to ‘no’ which means the patient died within 5 years.

haber[‘status’] = haber[‘status’].

map({1:’Yes’, 2:’No’})haber.

head() #mapping the values of 1 and 2 to yes and no respectively and #printing the first 5 records from the dataset.

Output:haber.

describe()#describes the datasetOutput:Observations:Count : Total number of values present in respective columns.

Mean: Mean of all the values present in the respective columns.

Std: Standard Deviation of the values present in the respective columns.

Min: The minimum value in the column.

25%: Gives the 25th percentile value.

50%: Gives the 50th percentile value.

75%: Gives the 75th percentile value.

Max: The maximum value in the column.

haber[“status”].

value_counts()#gives each count of the status typeOutput:Yes 225No 81Name: status, dtype: int64Observations:The value_counts() function tells how many data points for each class are present.

Here, it tells how many patients survived and how many did not survive.

Out of 306 patients, 225 patients survived and 81 did not.

The dataset is imbalanced.

status_yes = haber[haber[‘status’]==’Yes’]status_yes.

describe()#status_yes dataframe stores all the records where status is yesOutput:status_no = haber[haber[‘status’]==’No’]status_no.

describe()#status_no dataframe stores all the records where status is noObservations:The mean age and the year in which the patients got operated are almost similar of both the classes, while the mean of the nodes of both the classes differs by 5 units approximately.

The nodes of patients who survived are less when compared to patients who did not survive.

4.

Univariate AnalysisThe major purpose of the univariate analysis is to describe, summarize and find patterns in the single feature.

4.

1 Probability Density Function(PDF)Probability Density Function (PDF) is the probability that the variable takes a value x.

(a smoothed version of the histogram)Here the height of the bar denotes the percentage of data points under the corresponding groupsns.

FacetGrid(haber,hue=’status’,height = 5) .

map(sns.

distplot,”age”) .

add_legend();plt.

show()Output:PDF of AgeObservations:Major overlapping is observed, which tells us that survival chances are irrespective of a person’s age.

Although there is overlapping we can vaguely tell that people whose age is in the range 30–40 are more likely to survive, and 40–60 are less likely to survive.

While people whose age is in the range 60–75 have equal chances of surviving and not surviving.

Yet, this cannot be our final conclusion.

We cannot decide the survival chances of a patient just by considering the age parametersns.

FacetGrid(haber,hue=’status’,height = 5) .

map(sns.

distplot,”year”) .

add_legend();plt.

show()Output:PDF of YearObservations:There is major overlapping observed.

This graph only tells how many of the operations were successful and how many weren’t.

This cannot be a parameter to decide the patient’s survival chances.

However, it can be observed that in the years 1960 and 1965 there were more unsuccessful operations.

sns.

FacetGrid(haber,hue=’status’,height = 5) .

map(sns.

distplot,”nodes”) .

add_legend();plt.

show()Output:PDF of NodesObservations:Patients with no nodes or 1 node are more likely to survive.

There are very few chances of surviving if there are 25 or more nodes.

4.

2 Cumulative Distribution Function(CDF)The Cumulative Distribution Function (CDF) is the probability that the variable takes a value less than or equal to x.

counts1, bin_edges1 = np.

histogram(status_yes['nodes'], bins=10, density = True)pdf1 = counts1/(sum(counts1))print(pdf1);print(bin_edges1)cdf1 = np.

cumsum(pdf1)plt.

plot(bin_edges1[1:], pdf1)plt.

plot(bin_edges1[1:], cdf1, label = 'Yes')plt.

xlabel('nodes')print("***********************************************************")counts2, bin_edges2 = np.

histogram(status_no['nodes'], bins=10, density = True)pdf2 = counts2/(sum(counts2))print(pdf2);print(bin_edges2)cdf2 = np.

cumsum(pdf2)plt.

plot(bin_edges2[1:], pdf2)plt.

plot(bin_edges2[1:], cdf2, label = 'No')plt.

xlabel('nodes')plt.

legend()plt.

show()Output:[0.

83555556 0.

08 0.

02222222 0.

02666667 0.

01777778 0.

00444444 0.

00888889 0.

0.

0.

00444444][ 0.

4.

6 9.

2 13.

8 18.

4 23.

27.

6 32.

2 36.

8 41.

4 46.

] *************************************************************[0.

56790123 0.

14814815 0.

13580247 0.

04938272 0.

07407407 0.

0.

01234568 0.

0.

0.

01234568][ 0.

5.

2 10.

4 15.

6 20.

8 26.

31.

2 36.

4 41.

6 46.

8 52.

]CDF of NodesObservations:83.

55% of the patients who have survived had nodes in the range of 0–4.

64.

3 Box Plots and Violin PlotsThe box extends from the lower to upper quartile values of the data, with a line at the median.

The whiskers extend from the box to show the range of the data.

Outlier points are those past the end of the whiskers.

Violin plot is the combination of a box plot and probability density function(CDF).

sns.

boxplot(x='status',y='age',data=haber)plt.

show()sns.

boxplot(x='status',y='year',data=haber)plt.

show()sns.

boxplot(x='status',y='nodes',data=haber)plt.

show()Box plotssns.

violinplot(x=”status”,y=”age”,data = haber,height = 10)plt.

show()sns.

violinplot(x=”status”,y=”year”,data = haber,height = 10)plt.

show()sns.

violinplot(x=”status”,y=”nodes”,data = haber,height = 10)plt.

show()Output:Violin PlotsObservations:Patients with more than 1 nodes are not likely to survive.

More the number of nodes, lesser the survival chances.

A large percentage of patients who survived had 0 nodes.

Yet there is a small percentage of patients who had no positive axillary nodes died within 5 years of operation, thus an absence of positive axillary nodes cannot always guarantee survival.

There were comparatively more people who got operated in the year 1965 did not survive for more than 5 years.

There were comparatively more people in the age group 45 to 65 who did not survive.

Patient age alone is not an important parameter in determining the survival of a patient.

The box plots and violin plots for age and year parameters give similar results with a substantial overlap of data points.

The overlap in the box plot and the violin plot of nodes is less compared to other features but the overlap still exists and thus it is difficult to set a threshold to classify both classes of patients.

5.

Bi-Variate analysis5.

1 Scatter PlotsA scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis.

sns.

set_style(“whitegrid”)sns.

FacetGrid(haber, hue = “status” , height = 6) .

map(plt.

scatter,”age”,”year”) .

add_legend()plt.

show()Output:Scatter Plot: age vs nodesObservation:Patients with 0 nodes are more likely to survive irrespective of their age.

There are hardly any patients who have nodes more than 25.

Patients aged more than 50 with nodes more than 10 are less likely to survive.

5.

2 Pair PlotsBy default, this function will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column.

The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

sns.

set_style(“whitegrid”)sns.

pairplot(haber, hue=”status”, height = 5)plt.

show()Output:Pair PlotObservations:The plot between year and nodes is comparatively better.

6.

Multivariate analysis6.

1 Contour PlotA contour line or isoline of a function of two variables is a curve along which the function has a constant value.

It is a cross-section of the three-dimensional graph.

sns.

jointplot(x = ‘year’, y = ‘age’, data = haber, kind = “kde”)plt.

show()Output:Contour Plot year vs ageObservation:From 1960 to 1964, more operations done on the patients in the age group 45 to 55.

Conclusions:Patient’s age and operation year alone are not deciding factors for his/her survival.

Yet, people less than 35 years have more chance of survival.

Survival chance is inversely proportional to the number of positive axillary nodes.

We also saw that the absence of positive axillary nodes cannot always guarantee survival.

The objective of classifying the survival status of a new patient based on the given features is a difficult task as the data is imbalanced.

.