Interpretability and Random ForestsHow and why might we derive feature importance from random forest classifiers?Tom GriggBlockedUnblockFollowFollowingApr 8Machine learning came about because humans can’t always explain ourselves very well, especially to machines.

For a long time, machines were only capable of performing precise step-by-step instructions, and often simple human tasks come all too naturally for us to be able to explicitly write them down as an algorithm.

Take the example of recognising that something is a cat — I can’t explain to you or a computer how exactly I know something is a cat.

In fact, nobody ever really explained it to me: I just met a bunch of cats and eventually, being the good little neural network that I am, I got the gist.

You’ve got a lot to learn, kid.

I know a cat usually has two ears, four legs, a roundish face shape, and those distinctively feline eyes, but this is just unravelling the first layer of explanation.

To build on this, I’d have to algorithmically explain what every adjective and noun in that previous sentence means: two, ear, round, etc.

, as well as expanding on details such as exactly what a cat’s eye looks like.

In theory, I might actually be able to continue to unravel this term by term — it’s just it would probably take me a ridiculous amount of time.

We humans rather intelligently decided that it was probably easier to model the learning process mathematically than it was to algorithmically decompose every decision-making process ever, and that’s how we got started with machine learning.

However, the learning models that we use don’t always correspond to any sort of ‘natural’ approach to learning, and modelling learning doesn’t solve our fundamental communication problem with computers.

Machine learning does not get rid of the difficulty of explaining relationships in data, and this results in a conflict between interpretability and accuracy when building and analysing learned models.

The Conflict Between Interpretability and AccuracyThere are pretty much only two objectives in data science when it comes to the use of machine learning models:Application, whereby we use the trained model to perform a task, ideally as accurately and effectively as possible.

Interpretation, whereby we use the trained model to gain insight on our data via the learned relationship between feature and response variables.

As we just discussed, we humans have little to no idea how we actually go about recognising stuff, but we are really good at it.

In other words, the internal logic of our brains is accurate and great for application, but it isn’t very interpretable.

It shouldn’t come as a surprise then, that generally the most accurate machine learning methods are the least interpretable.

RoboCat: First ContactSo-called blackbox models, such as neural networks, give us little information regarding their decision-making processes; the algebraic complexity of the functions they learn tend to loose any meaning with respect to the original set of feature variables.

On the other hand, models that lend themselves to interpretability such as linear regression and decision trees tend to fall short in the accuracy department, as they often fail to capture any nuanced or complicated relationships within a dataset.

We could broadly summarise this relationship as follows:With data of sufficient complexity, there is a natural tradeoff between the interpretability of a decision algorithm and its accuracy in application.

In the same way that computers don’t have a natural affinity for understanding cats in terms of legs and eyes and ears, humans don’t possess an intrinsic understanding of higher order numerical relationships.

Our inability to meaningfully interpret the complex decision boundary of a neural network, and our inability to explain to a computer what a cat is, are two sides of the same coin.

Things simply get lost in translation in either direction.

Finding a BalanceOur prehistoric caveman brains seem to be quite fond of interpreting linear relationships/decision boundaries.

Linear regression is a highly interpretable algorithm, there’s no doubt about that: if x increases by 1, y increases by m, and we can all go home.

However, lines are simple and highly biased, and thus they don’t usually make for great learning algorithms.

Sure, we can tweak our definition of linear and expand our bases to include polynomial, exponential, and whatever-else terms, but at some point the debt must be paid, and we lose that sense of natural meaning in our parameters.

Barney and Fred, excited about Linear Regression.

Another simple to understand but fundamentally weak classifier is the decision tree: by greedily splitting feature space into rectangles, we end up with a pretty diagram describing the logic behind the decision-making process — and a fairly useless model for all but the most basic of relationships.

However, recall that tree models lend themselves nicely to ensemble methods, and that Random Forest is a particularly powerful approach for aggregating a large number of individually weak trees into a strong predictive model.

Random Forest and Feature ImportanceIt might seem surprising to learn that Random Forests are able to defy this interpretability-accuracy tradeoff, or at least push it to its limit.

After all, there is an inherently random element to a Random Forest’s decision-making process, and with so many trees, any inherent meaning may get lost in the woods.

However, in precisely the same way that the trees work together to reduce predictive error, they work together to accurately represent feature importance.

To understand how they do so, it is first necessary to look at a natural method for interpreting feature importance within a single tree.

Feature Importance in a Single TreeRecall that a single tree aims to reduce error in a locally optimal way as it splits feature space.

A classification tree uses a measure of impurity to score the current separation of classes, while a regression tree uses the residual squared error.

We’ll work with the idea of a classification tree to make our visualisations nice, but the regression case is the same after swapping out error functions…Reducing impurity (or entropy) is the fastest and most stable way to iteratively split regions in order to reduce the classification error of a decision tree.

A natural way to measure the impact a feature has within this decision making process is to look at the amount of entropy removed from the system by that feature— that is, the amount of information or accuracy that was gained by decisions made on the value of that feature alone.

The visualisation below demonstrates this process as we split feature space and build a decision tree.

We begin with an initial entropy value (D), and we calculate the reduction in each sub-region, and then sum up the attributed change in entropy for each feature variable across the tree.

We start with D=1.

31 and split feature space until we reach zero entropy.

1.

011 of the reduction is due to decisions made on y, while x is only responsible for 0.

299 of the reduction.

I’ve said it before and I’ll say it again: decision trees are weak classifiers.

A slight change in the training data could mean we end up with a wildly different tree, and thus different estimations for our feature importances.

Consider this slight alteration of the original data in our visualisation:Now the reduction in entropy due to x is 1.

011 while the reduction due to y is 0.

299.

The variable importances have switched!Reducing Variance with Many TreesIf this method for calculating feature importance is so volatile, it is not much use to us.

The problem is that the variance of the measure is too high, and this is where Random Forest comes in: recall that Random Forest ultimately works by first reducing bias towards the locally optimal strategy that individual trees have by essentially doing away with it and splitting over features randomly, and then by aggregating trees to reduce the overall variance of the model.

This reduction in variance stabilises the model, reduces its bias towards choices of training data, and leads to less variable and more accurate predictions.

If, as we do with our predictions, we aggregate our measures of feature importance over the trees in a Random Forest by taking the mean value for the change in entropy or accuracy attributed to each feature variable, we achieve exactly the same effect.

Intuitively, splitting on random features gives every feature in the model a chance to show its decision making power at all possible points throughout the levels of the tree, and aggregation reduces the variability of the final outcome.

Essentially, in the same way that Random Forest increases the accuracy of out final predictions, it increases the accuracy of this measure of feature importance.

If you don’t believe me, here’s a tiny glimpse of evidence:A, B, C are all i.

i.

dThere are some definite issues with this approach when pitting continuous variables against categorical variables; continuous variables have way more ‘room’ to split and so can get a one up over categorical variables without necessarily being more important.

This method also doesn’t really violate the idea of an interpretability-accuracy tradeoff because it only really tells us how variables stack up against each other, i.

e.

it doesn’t tell us what will happen to our decision if we increase or decrease our feature values (like good old linear regression).

But hey, it’s useful and I never said Random Forests were perfect.

Regardless, now you know the power of the feature_importances_ attribute on your scikit-learn RandomForestClassifier… but you still can’t explain what a cat is to a computer.

Oh well.

Later!P.

S.

This blog is messy and unpolished but I wanted to put something out.

I’ll spend the next day or so tidying it up.

Please email me for any desired points of clarification or feedback at thomasggrigg@gmail.

comBlog posts in the works: Kernel Trick — — — Bias-Variance Tradeoff — — — Deep Diving into why Random Forests work, (which I touched on a little here).

.