Machine Learning ExplainabilitySummary of the kaggle.
com Micro CoursePhillip WenigBlockedUnblockFollowFollowingFeb 20Recently, I did the micro course Machine Learning Explainability on kaggle.
I can highly recommend this course as I have learned a lot of useful methods to analyse a trained ML model.
For a brief overview of the topics covered, this blog post will summarize my learnings.
The following paragraphs will explain the methods Permutation Importance, Partial Dependence Plots and SHAP Values.
I will illustrate the methods using the famous Titanic dataset.
Photo by Maximilian Weisbecker on UnsplashBefore we startThe methods I am going to describe can be used with any model and are applied after a model is fit to the dataset.
The following questions answered refer each to a section in the mentioned online course.
The Titanic dataset can be used to train a classification model which predicts whether a passenger survived or died on the Titanic.
I used a simple DecisionTree for this task and did not optimize it nor balanced the data, etc.
It was simply trained in order to illustrate the methods.
The code for the analyses can be found on this GitHub repository.
What variables most affect the survival?For this question, the course suggests the Permutation Importance method.
This method is simply taking a column of the data and shuffles its values keeping the other columns fixed.
With that altered data, the method is calculating the models performance.
The more the performance decreased in comparison to the original data, the more important the feature in the shuffled column is for the model.
This is done for all columns one by one in order to find the importance of all features.
Using the eli5 python package, I found out the following numbers for our dataset.
Importance of each feature for the survival on the TitanicAs we can see, the most important feature for surviving the Titanic is the sex followed by the age and the fare.
These numbers only tell us which features are important but not how they affect the predictions, e.
we don’t see whether it is better to be female or male.
To find out how features are affecting the outcome, the course suggests using Partial Dependence Plots.
How do the variables affect the survival?One method for finding out how variables affect the outcome is Partial Dependence Plots as suggested by the online course.
This method takes a row of the dataset and repeatedly changes a value for one feature.
It is done multiple times with different rows and then aggregated in order to find out how the feature is influencing the target on a wide range.
The python package pdpbox can be used to create a plot showing how the outcome behaves using different values.
Plotting the partial dependence of ‘age’ on the target looks like the following for our dataset.
Partial dependence plot for age on survival on TitanicThe plot shows in a very nice way that having an age the probability of surviving on the Titanic shrinks.
Especially between 20 and 30 wasn’t a nice age being on the Titanic.
The blue area around the line is the line’s confidence.
It shows that other factors also play a big role whether someone survived.
Luckily the pdpbox library provides 2-dimensional plots to show the interaction of 2 features towards the outcome.
Let’s see how the age is interacting with the class the passenger traveled in.
Partial dependence plot for age-class interaction on survival on TitanicThe interaction plot shows that having a specific age the class can really make a difference towards survival, e.
being between 20 and 30 in 3rd class gives a survival probability of below 0.
3 whereas having a 1st class ticket results in a probability of about 0.
5 being the same age.
Let’s look at that specific 30 years old passenger in the 3rd class and see how her conditions influenced her survival.
For analyzing specific data samples, the course suggests using SHAP values.
How did the variables affect the survival of a specific passenger?The SHAP values are used to show the effects of the features of a single user.
Here, the method also takes one feature and compares the value to a baseline value for that feature without changing the other features.
That is done for all features.
In the end, the method returns all SHAP values which sum up to 1.
Some of those values are positive affecting the outcome positively and some are negative affecting the outcome negatively.
Our specific 30 years old passenger has the following SHAP values.
Showing parameters of a specific passenger influencing the outputThe biggest impact has the fact that she is a woman, the fare price and so on.
The blue value, which shows the 3rd class she was travelling in, had a negative effect on the outcome.
Summary PlotTo show a summary of the SHAP values for all passengers, I used the summary plot which looks like the following.
Summary plot of features importance towards the survival on the TitanicThis plot shows the feature values by color with red being the maximum value and blue being the minimum.
Hence the categorical feature “Sex” has only 2 colors.
The horizontal position is the specific effect on the sample outcome, the vertical position shows the feature importance.
It can be seen that being female mostly had an positive effect and being male was mostly negative in terms of survival.
The “Fare” feature, however, is less distinguishable, for example.
Most of its values are spread across the x-axis.
Dependence Contribution PlotsTo show how age contributed to the actual outcome given the sex of a passenger, the dependence contribution plots come in handy.
The following plot shows this contribution.
Interaction of Age and Sex towards contribution to survivals on the TitanicWe see a small downwards trend with increasing age for female passengers (red dots).
Though, these dots mainly occur above 0 being a positive value.
The male passengers are mostly below 0, especially for values between 20 and 30, as mentioned before.
ConclusionI really enjoyed taking this course for multiple reasons.
However, it is also very important for every Data Analyst and Scientist to either analyse the data or the model itself.
Especially, when thinking about deep neural networks or other kind of black box methods which are hard to follow decision-wise, the introduced methods can definitely add value.