Predicting Animal Shelter OutcomesA guide to handling categorical variables in supervised machine learningRebecca VickeryBlockedUnblockFollowFollowingFeb 18Photo by Berkay Gumustekin on UnsplashI have been working quite a lot recently with encoding categorical variables in Python for machine learning.
I wanted to write a post covering some of the things that I have learnt along the way.
Kaggle have a data set from the Austin Animal Center in which almost all variables are categorical.
The following post covers some examples of how we might process these into features that will be useful in a predictive model.
The data set can be downloaded here, I am using the train.
csv file in my workflow.
The data comprises a number of characteristics about animals that were admitted to the Austin Animal Center between 1st October 2013 and March 2016.
You are asked to predict the outcome for each animal of which there are 5 possibilities;AdoptionTransferReturn to ownerEuthanasiaDiedBasic data explorationFirstly I have read in the train.
csv file and produced a simple bar plot to visualise the distribution for the outcome types.
import pandas as pdimport matplotlib.
pyplot as plt% matplotlib inlinetrain = pd.
bar()We can see that there are differences in class sizes, in particular, the classes died and euthanasia are, thankfully, much smaller than the others.
This could prove difficult to predict these outcomes without handling the imbalances.
However, I am not going to cover that here as I would like to focus more on processing the variables.
dtypesYou can see from the above that all columns in the data are non-numeric.
Each one will need some pre-processing performed before I can use them in a classifier.
In order to get an idea of how I might process them I run the following to get a count of unique values in each column.
We can see from the below that some columns, such as “Breed” have a large number of unique values.
These will need some more attention in comparison to others.
columns = train.
columnsfor column in columns: print(column) print(train[column].
nunique())Processing categorical dataGenerally there are three main ways to convert categorical data into numeric.
These are as follows:1.
One hot encodingOne hot encoding takes categorical data and makes new columns for each unique value.
A 0 or 1 denotes wether or not the row has that value.
Lets look at an example using our animal shelter data.
Take the column “AnimalTtype”, in this column we have two unique values, cat and dog.
When we apply a one hot encoder transformation, two new columns are produced, each containing 1’s and 0’s.
This technique works well for columns such as this where the data is of low cardinality.
But if the columns are high cardinality or the order of the categories has a meaning, that can be interpreted in a numerical fashion, then more complex approaches should be used.
In one hot encoding a new column is produced for each unique value2.
Label encodingLabel encoding simply maps the unique values to numbers.
For example, if we were to apply this to the “SexuponOutcome” column, then each unique value would become a number that could be mapped back to the string representation.
This approach however has significant drawbacks for handling categorical variables.
The problem is that with categorical data there isn’t actually a relationship between the rows, but if we simply convert it to numbers, the classifier may infer relationships that are not present.
For example, if “Neutered Male” was given 1 and “Spayed Female’” 2, the classifier may infer that ‘Spayed Female’ is more important because it has a larger number.
Converting to ordinal or continuous variablesYou may find that some categorical columns actually have a numerical meaning.
In this data set that is true for the AgeuponOutcome feature, where a higher value should be treated as being greater than others.
This isn’t really categorical data, and therefore should be transformed into the numerical representation of the age.
I will discuss the method I used for this later on in the post.
Missing dataBefore processing the variables I am going to do a little cleaning of the data.
I need to check for any missing values and handle them appropriately.
Firstly I run the following code to check for missing values.
apply(lambda x: sum(x.
isnull()/len(train)))This gives the following output.
You can see that we have a high number of missing values for two of the features, “Name” and “OutcomeSubtype”.
But only a small number of missing values for “SexuponOutcome” and “AgeuponOutcome”.
I am therefore going to have to use a number of different methods for handling this missing data.
Looking at the data I can see that the “OutcomeSubtype” is a further categorisation of the label we are trying to predict.
Therefore in a real world use case for this predictive model it is unlikely that we would have this feature in the data.
I am therefore going to drop this from my features.
train = train.
drop('OutcomeSubtype', axis=1)Missing data in the “Name” field could well turn out to have some impact on the outcome of an animal.
Particularly in an outcome such as “return to owner” so rather than seek to fill the missing data with a value I will instead turn it into a new feature “has_name”.
This will have a 0 where a name is missing and a 1 where it is present.
To do this I fill all missing data with a 0 in the “Name” column.
I then use this to create a new column called “has_name”.
Finally I drop the “Name” column as this is unlikely to be useful in the final model.
train['Name'] = train[['Name']].
fillna(value=0)train['has_name'] = (train['Name'] != 0).
astype('int64')train = train.
drop('Name', axis=1)Finally I will handle the missing values in “SexuponOutcome” and “AgeuponOutcome”.
As these are both categorical and have only a small amount of missing data the simplest method will be to fill with the most commonly occurring value.
The following code does this.
train = train.
apply(lambda x: sum(x.
isnull()/len(train)))We can see from the output that we now have no missing values and we have created a new feature.
Finally I am going to drop the AnimalID column as that will not be useful in the model.
train = train.
drop('AnimalID', axis=1)High cardinalityAs previously described the most sensible method to convert categorical variables into numerical data is with one hot encoding.
However, there is a problem when you have a feature that has high cardinality, or in other words has a large number of unique values.
We have two examples of this in our data set — “Breed” and “Color”.
If we were to simply use one hot encoding for these features we would end up creating 1,380 new features from “Breed” and 366 from “Color”.
This would be unlikely to create a good performing model.
There are a number of ways to handle this situation and I will talk through them in the next section.
Most popular valuesOne way to handle this is to encode each of the most commonly occurring values, and place the remaining values into a label called “other”.
Lets take “Color” as an example — we can run the code below which takes any values occurring fewer than 300 times and places them into the “other” label.
We use this to create a new feature called “top_colors”.
color_counts = train['Color'].
value_counts()color_others = set(color_counts[color_counts < 300].
index)train['top_colors'] = train['Color'].
nunique())This reduces the number of unique values in the column from 366 to 28.
This is a much more manageable number for one hot encoding.
New featuresAnother method is to create a meaningful categorisation of the values.
I will give an example of this here, with the “Breed” column.
We can see from the data that some of the values contain the word “Mix” whilst others don’t.
We can assume with some limited knowledge of animals that this suggests that mix means that the breed is a cross-breed.
This could prove to be a useful feature so we can use the code below to create this.
import retrain['breed_type'] = train.
fillna('pure')I also did something similar with the “Color”.
Creating a new feature called “multi_colors” containing a categorisation as to wether the animal has a single colour or not.
train['multi_colors'] = train['Color'].
apply(lambda x : 1 if '/' in x else 0)3.
Numerical representationIn this data set the “AgeuponOutcome” feature is not really categorical data.
The order of the values has a meaning.
Therefore the best way to handle this is to convert it into its numerical representation.
I am going to convert this feature into the age in days.
I found this fantastic function in this Kaggle kernel which does this perfectly.
This code creates a new feature and drops the original column.
def age_converter(row): age_string = row['AgeuponOutcome'] [age,unit] = age_string.
split(" ") unit = unit.
lower() if("day" in unit): if age=='0': return 1 return int(age) if("week" in unit): if(age)=='0': return 7 return int(age)*7 elif("month" in unit): if(age)=='0': return 30 return int(age) * 4*7 elif("year" in unit): if(age)=='0': return 365 return int(age) * 4*12*7train['age_numeric'] = train.
apply(age_converter, axis=1)train = train.
drop('AgeuponOutcome', axis=1)The remaining columns for the first iteration of this model can be directly converted using one hot encoding.
I am using pandas get_dummies to convert all categorical variables in the code below.
train = train.
drop(['Breed','Color', 'DateTime'], axis=1)numeric_features = train.
columnscategorical_features = train.
columnsdummy_columns = pd.
get_dummies(train[categorical_features])final_train = pd.
concat([dummy_columns, train],axis=1)final_train = final_train.
drop(['AnimalType', 'breed_type', 'SexuponOutcome', 'top_colors'], axis=1)We now have fully numerical data now lets see how this performs in a classifier.
Training a classifierIn following code I am specifying the features X and target y, and using scikit-learn train_test_split to create the training and testing data.
X = final_train.
drop('OutcomeType', axis=1)y = final_train['OutcomeType']from sklearn.
model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
20, random_state=1)I am then using a simple RandomForestClassifier to find out roughly how the model will perform.
ensemble import RandomForestClassifierfrom sklearn.
metrics import accuracy_score, log_lossclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)rf_model = clf.
fit(X_train, y_train)y_pred = clf.
predict(X_test)print(accuracy_score(y_test,y_pred))y_prob = rf_model.
predict_proba(X_test)print(log_loss(y_test, y_prob))This model does not perform particularly well, so it looks like there is more work to be done on feature engineering, model selection or optimisation.
I am not going to cover this in depth here but as one final thing lets take a peek at the feature importances for this model.
This may inform any further feature transformations or engineering work.
import numpy as npfeatures=X.
columnsimportances = rf_model.
feature_importances_indices = np.
barh(range(len(indices)), importances[indices], color='b', align='center')plt.
showThe output show that features relating to the presence or absence of a name, age, sex and animal type are of particular importance.
There is likely to be a lot more work we can do here to improve the performance of this model.
For example there are many more possibilities for transforming the categorical variables than I have listed here.
As with a lot of data science work much of this is down to trial and error.
I hope that in this article I have given a broad introduction into the possibilities of how to handle categorical variables in a machine learning project.