Data Manipulation for Machine Learning with PandasAn introduction to some of the data tools provided by Pandas for use in a machine learning projectRebecca VickeryBlockedUnblockFollowFollowingApr 5Photo by Michael Fenton on UnsplashThe python pandas library is an open source project that provides a variety of easy to use tools for data manipulation and analysis.
A substantial amount of time in any machine learning project will have to be spent preparing the data, and analysing basic trends and patterns, before actually building any models.
In the following post I want to provide a brief introduction to the various tools available in pandas for manipulating, cleaning, transforming and analysing data before embarking on model building.
Throughout this article I will be using a dataset from drivendata.
org available here.
This training data comprises two separate csv files, one contains characteristics about a number of patients, and the second contains a binary label “heart_disease_present”, which represents wether or not the patient has heart disease.
Importing DataPandas provides tools to read data from a wide variety of sources.
As the dataset I am using is a csv file I will use the read_csv function.
This function has a large number of options for parsing the data.
For most files the default options work fine — this is the case here.
import pandas as pdtrain_values = pd.
csv')train_labels = pd.
csv')In order to analyse the data I will need both the train_values and train_labels to be combined into one dataframe.
Pandas provides a merge function that will join dataframes on either columns or indexes.
In the following code I am performing an inner merge using the patient_id to join the correct value with the correct labels.
train = pd.
merge(train_values, train_labels, left_on='patient_id', right_on='patient_id', how='inner')Missing DataPandas provides a number of functions to deal with missing data.
To start with we can use the isna() function to understand how many missing values we have in our data.
The basic functionality of this looks at every value in each row, and column, and returns True if it is missing and false if it is not.
We can therefore write a function that returns the fraction of missing values in each column.
apply(lambda x: sum(x.
isna()/len(train)))In this dataset there are not actually any missing values present.
However, if there were we could either use DataFrame.
fillna() to replace with another value, or we could use DataFrame.
dropna() to drop the rows containing missing values.
When using fillna() you have a number of options.
You can replace with a static value which can be either a string or a number.
You can also replace with a calculation such as mean.
It is very likely that you will have to use a different strategy for different columns depending on the data types and volume of missing values.
In the code below I am demonstrating how you could use some other handy pandas functions, select_dtypes and DataFrame.
columns, to only fill the numerical values with the mean.
columns] = train[train.
mean()))Visualising dataPlotting in pandas isn’t exactly fancy but if you want to quickly identify some trends from data it can often be the most efficient way to do this.
The basic plotting function is simply to call plt.
plot() on a series or dataframe.
Plotting in pandas references the matplotlib API so you need to import matplotlib first in order to access this.
This function supports many different visualisation types including line, bar, histograms, boxplots and scatter plots.
Where the plotting function in pandas becomes really useful is when you combine it with other data aggregation functions.
I will give a couple of examples below.
Combining value_counts() with the bar plot option, gives a quick visualisation for categorical features.
In the code below I am looking at the distribution for thal (a measure of blood flow to the heart) using this method.
pyplot as plt% matplotlib inlinetrain['thal'].
bar()Using the groupby function we can plot the mean resting_blood_pressure by slope_of_peak_exercise_st_segment.
plot(kind='bar')Pandas pivot tables can also be used to provide visualisations of aggregated data.
Here I am comparing mean serum_cholesterol_mg_per_dl by chest_pain_type and the relationship to heart disease being present.
import numpy as nppd.
pivot_table(train, index='chest_pain_type', columns= 'heart_disease_present', values= "serum_cholesterol_mg_per_dl", aggfunc=np.
plot(kind= 'bar')Feature transformationPandas also has a number of functions that can be used for most feature transformations you may need to undertake.
For example, most commonly used machine learning libraries require data to be numerical.
It is therefore necessary to transform any non-numeric features, and generally speaking the best way to do this is with one hot encoding.
Pandas has a method for this called get_dummies.
This function, when applied to a column of data, converts each unique value into a new binary column.
train = train.
drop('patient_id', axis=1)train = pd.
columns)Another way in which a feature may need to be transformed for machine learning is binning.
An example in this data set is the age feature.
It may be more meaningful to group the ages into ranges (or bins) for the model to learn.
Pandas also has a function pd.
cut which can be used for this.
bins = [0, 30, 40, 50, 60, 70, 100]train['age_group'] = pd.
plot(kind='bar')This is just an introduction to some of the features in pandas for use in the early stages of a machine learning project.
There are many more aspects both to data manipulation and analysis, and the pandas library itself.
This can often be a very time consuming stage, and I find that pandas provides access to a wide variety of functions and tools, that can help to make the process more efficient.