Of course not!For instance, in the grades formula above, perhaps we added another feature, hours worked (at a job outside of school).
If we broadly assume that the working hours are necessary for financial support, then the working hours would likely have a causal effect on study hours, and not vice versa.
Therefore, hours worked would be a causal descendent of hours studied.
Partial Dependence PlotsOne tool that is useful for causal interpretation is the partial dependence plot (PDP).
They are used to glean a feature’s marginal impact on the model’s predicted outcomes.
To massively simplify, we can think of it as just plotting the average predicted outcome (vertical axis) for each potential value of a feature (horizontal axis).
And for a PDP to be useful for causal inference, the variable in question can’t have any other variables between it and the target variable be a causal descendent.
Otherwise, interactions with any descendants can cloud interpretation.
For example, the following PDP shows that average predicted bike rentals generally go up with higher temperatures.
I too don’t like to bike when it’s humid, or windy.
htmlLet’s dive into another example using that old chestnut, the Boston housing dataset.
The dataset presents a target of median home value in Boston (MEDV) and several features, such as per capita crime rate by town (CRIM) and nitric oxides concentration (NOX, in parts per 10 million (pp10m)).
It doesn’t take much domain expertise to inspect these and the rest of the available features to conclude that none could reasonably be a causal descendant of NOX.
It’s more likely that NOX is affected by one or more of the other features — “the proportion of non-retail business acres per town” (INDUS), for one — than for NOX to affect INDUS.
This assumption allows us to use a PDP for causal inference:When nitric oxide concentration increased past 0.
67 pp10m, Bostonians said NO to higher house prices.
Note that this plot centers on the vertical axis to mean=0.
What we may infer from the above is that median home prices seem to be insensitive to NOX levels until around 0.
67 pp10m, at which median home levels drop by about $2,000.
Individual Conditional Expectation plotsBut what if we are unsure of the causal direction of our features?.One tool that can help is an Individual Conditional Expectation (ICE) plot.
Instead of plotting the average prediction based on the value of a feature, it plots a line for each observation across possible values of the feature.
Let’s dig into ICE plots by revisiting our NOX example.
Nice ICE, baby.
This ICE plot seems to support what we saw with the PDP: the individual curves all appear to be similar in shape and direction, and like before, drop down in ‘level’ around NOX = 0.
But we already theorized earlier that NOX is a causal descendent of one or more of the other features in the dataset, so the ICE plot only serves to confirm what the PDP showed.
What if we explored a different feature, “weighted distances to five Boston employment centers” (DIS)?.One may argue that a feature such as CRIM, could be a causal descendant of DIS.
If we look at an ICE plot for DIS:We find a mix of patterns!.At higher levels of MEDV, there is a downward trend as DIS increases.
However!.At lower levels of MEDV, we observe some curves showing DIS having a brief positive effect on MEDV, up to around DIS=2 or so, and then becoming a negative effect.
The takeaway is that the ICE plot helps us identify that this feature is likely indirectly affecting the target, due to interactions with one or more other features.
For another application of ICE plots, let us consider an example using the ubiquitous “auto mpg” dataset.
The following plot shows that acceleration has some causal effect on MPG, but likely through interactions with other features.
Notice the difference in the behavior of the lines at the top (somewhat increasing in MPG), middle (decreasing), and lower third (increasing again) of the plot!If we look at the other features in the dataset, we find one for origin, which is the geographical origin of the auto.
This feature is arguably a causal ancestor of all features — you need to have a place to build the car before you can build it!.(A gross oversimplification, I know, but still).
Being as such, it will likely have many interactions involved with its causal relationship with MPG.
Can ICE plots still be useful here, even though the feature is ‘far upstream’?.You bet!.Let’s start with looking at a trusty boxplot:American cars guzzle gas.
At least in this dataset.
This plot shows a very noticeable difference in MPG between autos from these three regions.
But does this tell the whole story?Consider the following two ICE plots.
The first shows [US (1) or Europe(0)] versus MPG:US (1) vs.
Europe (0)… and the second shows [Japan (1) or Europe(0)] versus MPGJapan (1) vs.
Europe (0)While at first, via the boxplot, it seemed that there is a significant difference in MPG attributable to origin, the ICE plots show that the pure impact might be a little smaller when considering interactions with other features: the slope for the majority of these lines are flatter than the boxplot would have us imagine.
Boxing It UpWhen dealing with so-called black-box algorithms, we need to use smart methods to interpret the results.
One point of view to do that is through inferring causality.
Some tools can help with that:Partial Dependence Plots (PDPs)Individual Conditional Expectation Plots (ICE Plots)Your or your team’s or domain knowledge!.While we increasingly have many fancy tools to employ in the name of data science, it can’t replace hard-earned domain knowledge and well-honed critical thinking skills.
Thanks for reading!Work files here.
Please feel free to reach out!.| LinkedIn | GitHubSources:Breiman.
Statistical modeling: The two cultures.
Statistical Science, 16(3):199– 231, 2001b.
James, Gareth, et al.
An Introduction to Statistical Learning: with Applications in R.
Interpretable Machine Learning.
io/interpretable-ml-book Accessed June 2019.
Graphical models, causality and intervention.
Statistical Science, 8(3):266–269.
Zhao & Hastie.
Causal Interpretations of Black-Box Models.