Predicting presence of Heart Diseases using Machine LearningKaran BhanotBlockedUnblockFollowFollowingFeb 12Photo by rawpixel on UnsplashMachine Learning is used across many spheres around the world.
The healthcare industry is no exception.
Machine Learning can play an essential role in predicting presence/absence of Locomotor disorders, Heart diseases and more.
Such information, if predicted well in advance, can provide important insights to doctors who can then adapt their diagnosis and treatment per patient basis.
In this article, I’ll discuss a project where I worked on predicting potential Heart Diseases in people using Machine Learning algorithms.
The algorithms included K Neighbors Classifier, Support Vector Classifier, Decision Tree Classifier and Random Forest Classifier.
The dataset has been taken from Kaggle.
My complete project is available at Heart Disease Prediction.
Import librariesI imported several libraries for the project:numpy: To work with arrayspandas: To work with csv files and dataframesmatplotlib: To create charts using pyplot, define parameters using rcParams and color them with cm.
rainbowwarnings: To ignore all warnings which might be showing up in the notebook due to past/future depreciation of a featuretrain_test_split: To split the dataset into training and testing dataStandardScaler: To scale all the features, so that the Machine Learning model better adapts to the datasetNext, I imported all the necessary Machine Learning algorithms.
Import datasetAfter downloading the dataset from Kaggle, I saved it to my working directory with the name dataset.
Next, I used read_csv() to read the dataset and save it to the dataset variable.
Before any analysis, I just wanted to take a look at the data.
So, I used the info() method.
As you can see from the output above, there are a total of 13 features and 1 target variable.
Also, there are no missing values so we don’t need to take care of any null values.
Next, I used describe() method.
describe()The method revealed that the range of each variable is different.
The maximum value of age is 77 but for chol it is 564.
Thus, feature scaling must be performed on the dataset.
Understanding the dataCorrelation MatrixTo begin with, let’s see the correlation matrix of features and try to analyse it.
The figure size is defined to 12 x 8 by using rcParams.
Then, I used pyplot to show the correlation matrix.
Using xticks and yticks, I’ve added names to the correlation matrix.
colorbar() shows the colorbar for the matrix.
Correlation MatrixIt’s easy to see that there is no single feature that has a very high correlation with our target value.
Also, some of the features have a negative correlation with the target value and some have positive.
HistogramThe best part about this type of plot is that it just takes a single command to draw the plots and it provides so much information in return.
Just use dataset.
hist()Let’s take a look at the plots.
It shows how each feature and label is distributed along different ranges, which further confirms the need for scaling.
Next, wherever you see discrete bars, it basically means that each of these is actually a categorical variable.
We will need to handle these categorical variables before applying Machine Learning.
Our target labels have two classes, 0 for no disease and 1 for disease.
Bar Plot for Target ClassIt’s really essential that the dataset we are working on should be approximately balanced.
An extremely imbalanced dataset can render the whole model training useless and thus, will be of no use.
Let’s understand it with an example.
Let’s say we have a dataset of 100 people with 99 non-patients and 1 patient.
Without even training and learning anything, the model can always say that any new person would be a non-patient and have an accuracy of 99%.
However, as we are more interested in identifying the 1 person who is a patient, we need balanced datasets so that our model actually learns.
For x-axis I used the unique() values from the target column and then set their name using xticks.
For y-axis, I used value_count() to get the values for each class.
I colored the bars as green and red.
From the plot, we can see that the classes are almost balanced and we are good to proceed with data processing.
Data ProcessingTo work with categorical variables, we should break each categorical column into dummy columns with 1s and 0s.
Let’s say we have a column Gender, with values 1 for Male and 0 for Female.
It needs to be converted into two columns with the value 1 where the column would be true and 0 where it will be false.
Take a look at the Gist below.
To get this done, we use the get_dummies() method from pandas.
Next, we need to scale the dataset for which we will use the StandardScaler.
The fit_transform() method of the scaler scales the data and we update the columns.
The dataset is now ready.
We can begin with training our models.
Machine LearningIn this project, I took 4 algorithms and varied their various parameters and compared the final models.
I split the dataset into 67% training data and 33% testing data.
K Neighbors ClassifierThis classifier looks for the classes of K nearest neighbors of a given data point and based on the majority class, it assigns a class to this data point.
However, the number of neighbors can be varied.
I varied them from 1 to 20 neighbors and calculated the test score in each case.
Then, I plot a line graph of the number of neighbors and the test score achieved in each case.
As you can see, we achieved the maximum score of 87% when the number of neighbors was chosen to be 8.
Support Vector ClassifierThis classifier aims at forming a hyperplane that can separate the classes as much as possible by adjusting the distance between the data points and the hyperplane.
There are several kernels based on which the hyperplane is decided.
I tried four kernels namely, linear, poly, rbf, and sigmoid.
Once I had the scores for each, I used the rainbow method to select different colors for each bar and plot a bar graph of the scores achieved by each.
As can be seen from the plot above, the linear kernel performed the best for this dataset and achieved a score of 83%.
Decision Tree ClassifierThis classifier creates a decision tree based on which, it assigns the class values to each data point.
Here, we can vary the maximum number of features to be considered while creating the model.
I range features from 1 to 30 (the total features in the dataset after dummy columns were added).
Once we have the scores, we can then plot a line graph and see the effect of the number of features on the model scores.
From the line graph above, we can clearly see that the maximum score is 79% and is achieved for maximum features being selected to be either 2, 4 or 18.
Random Forest ClassifierThis classifier takes the concept of decision trees to the next level.
It creates a forest of trees where each tree is formed by a random selection of features from the total features.
Here, we can vary the number of trees that will be used to predict the class.
I calculate test scores over 10, 100, 200, 500 and 1000 trees.
Next, I plot these scores across a bar graph to see which gave the best results.
You may notice that I did not directly set the X values as the array [10, 100, 200, 500, 1000].
It will show a continuous plot from 10 to 1000, which would be impossible to decipher.
So, to solve this issue, I first used the X values as [1, 2, 3, 4, 5].
Then, I renamed them using xticks.
Taking a look at the bar graph, we can see that the maximum score of 84% was achieved for both 100 and 500 trees.
ConclusionThe project involved analysis of the heart disease patient dataset with proper data processing.
Then, 4 models were trained and tested with maximum scores as follows:K Neighbors Classifier: 87%Support Vector Classifier: 83%Decision Tree Classifier: 79%Random Forest Classifier: 84%K Neighbors Classifier scored the best score of 87% with 8 neighbors.
Thank you for reading!.Feel free to share your thoughts and ideas.
.. More details