Understanding how to explain predictions with “explanation vectors”

Understanding how to explain predictions with “explanation vectors”Pol FerrandoBlockedUnblockFollowFollowingJan 10In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model.

After the posts focused on LIME and Shapley values, now it’s the turn of Explanation vectors, a method presented by David Baehrens, Timon Schroeter and Stefan Harmeling in 2010.

As we have seen in the mentioned posts, explaining a decision of a black box model implies understanding what input features made the model give its prediction for the observation being explained.

Intuitively, a feature has a lot of influence on the model decision if small variations in its value cause large variations of the model’s output, while a feature has little influence on the prediction if big changes in that variable barely affect the model’s output.

Since a model is a scalar function, its gradient points in the direction of the greatest rate of increase of the model’s output, so it can be used as a measure of features’ influence.

In classification tasks, if c is the predicted class for an instance, the gradient of the conditional probability P(Y≠ c|X=x) evaluated at the instance points in the direction to where the data point has to be moved to change its predicted label, so it provides qualitative understanding of the most influential features in the model decision.

Similarly, although Baehrens et al.

left to future work the generalization to regression tasks, it seems natural to use the gradient of the regression function because it points in the direction to where local small variations of the data point would mean large variations of the model’s output.

Therefore, explanations can be defined as gradient vectors that characterize how a data point has to be moved to change its prediction, which is the reason why they are called explanation vectors.

Method for classificationBaehrens et al.

propose to explain a prediction of any classifier by the gradient at the point of interest of the conditional probability of the class not being the predicted one given its feature values.

Therefore, the explanation is a vector (called explanation vector) which characterizes how the data point has to be moved to change the predicted label of the instance.

This definition can be directly applied to probabilistic (or soft) classifiers, whose outputs can be interpreted as probabilities, because they explicitly estimate the class conditional probabilities.

However, hard classifiers (e.

g.

, support vector machines) directly estimate the decision rule, that is, they assign the predicted class label without producing a probability estimation.

For this type of classifiers, this method proposes to use a kernel density estimator (also called Parzen-Rosenblatt window) to estimate the class conditional probabilities so we can approximate the hard classifier by a probabilistic classifier.

Probabilistic classifiersLet X=ℝᵖ be the feature space.

Let X=(X₁,…,Xₚ)∈ℝᵖ be a vector of p continuous random variable and Y be a discrete random variable with possible values (class labels) in {1,…,k}.

Let P(X,Y) be the joint distribution, which is unknown most of the time.

Let f: ℝᵖ → [0,1]ᵏ such that f₁(x)+f₂(x)+…+fₖ(x)=1, ∀ x∈X, be the “probabilistic” classifier being explained.

Furthermore, we assume that all components of f are first-order differentiable with respect to X for all classes k and over the entire input space.

Finally, let y∈ℝᵖ be the observation being explained.

Each component fᵢ of a prediction f(x) is an estimation of the conditional probability P(Y=i|X=x).

However, “probabilistic” classifiers can then be turned to “hard” classifiers using the Bayes rule, which is the optimal decision rule for the 0-1 loss function:Note that {1-fᵢ(x)} is an estimation of P(Y≠i|X=x).

That is, the predicted class is the one which has the highest probability.

The explanation vector of a data point y∈ℝᵖ is defined as the gradient at x=y of the conditional probability of Y is not the predicted class given X=x:Thus, the explanation vector ζ(y) is a p-dimensional vector that points in the direction to where the data point has to be moved to change its predicted class.

A positive (negative) sign of a component implies that increasing the corresponding feature would lower (raise) the probability that y is assigned to hat{y}.

Furthermore, the larger is the absolute value of a component, the more influential is that feature in the class label prediction.

One issue with explanation vectors is that they can become a zero vector.

If this happens because of a local maximum or minimum, we can learn from the eigenvectors of the Hessian the features that are relevant for the model decision even though we will not obtain an orientation for them.

However, if the classifier outcome is a probability distribution which is flat in some neighborhood of y, no meaningful explanation can be obtained.

In summary, the explanation vectors method fits well to classifiers that outputs a probability of the class function not completely flat.

In the case of binary classification, Baehrens et al.

define the explanation vector as the local gradient of the probability function P(Y = 1 | X = x) of the learned model for the positive class.

Formally, the explanation vector of a data point y∈ℝᵖ is:where f: ℝᵖ→ [0,1] is a binary classifier.

Note that f(x) is an estimation of P(Y = 1 | X = x).

Therefore, the explanation vector points in the direction of the steepest ascent from the data point to higher probabilities for the class 1.

Also, the sign of each component indicates whether the predicted probability of being 1 would increase or decrease when the corresponding feature of y is increased locally.

Finally, note that the general definition of the explanation vector will differ from the one of binary classification when the predicted class is 1, case in which the negative version -ζ(y) may be especially helpful because it indicates how to move the data point to be assigned to class 0.

The picture below shows an example of how explanation vectors are applied to explain predictions of a binary classifier: we have labeled training data (panel (a)) that we use to train a model (in this case, a Gaussian Process Classifier), which assigns a probability of being in the positive class to every data point of the feature space (panel (b)).

Then, we compute explanation vectors of the data points we want to explain in order to understand what features made the model give its predictions.

For instance, in panels (c) and (d) we can see that explanation vectors along the hypothenuse and at the corners of the triangle produced by the data have both components different from zero, while explanation vectors along the edges only have one non-zero component.

Thus, both explanatory variables influence the model decision for observations along the hypothenuse and at the corners of the triangle, while only one feature was relevant to the model prediction for observations along the edges.

Furthermore, the length of the explanation vectors (panel (c)) represents the degree of importance of the relevant features, so we can compare explanations between observations.

Example of how explanation vectors are applied to model predictions learned by a Gaussian Process Classifier (GPC), which provides probabilistic predictions.

Panel (a) shows the training points and their labels (class +1 in red, class -1 in blue).

Panel (b) shows the trained model, which assigns a probability of being in the positive class to every data point.

Panels (c) and (d) show the explanation vectors and the direction of the explanation vectors, respectively, together with the contour map of the model.

Source: http://www.

jmlr.

org/papers/volume11/baehrens10a/baehrens10a.

pdfHard classifiersLet X=ℝᵖ be the feature space.

Let X=(X₁,…,Xₚ)∈ℝᵖ be a vector of p continuous random variables and Y be a discrete random variable with possible values (class labels) in {1,…,k}.

Let P(X,Y) be the joint distribution, which is unknown.

Let f: X=ℝᵖ → {1,…,k} be the “hard” classifier being explained.

Finally, let y∈ℝᵖ be the observation being explained.

Let x¹,…, xⁿ∈ℝᵖ be training points with their respective labels.

Let Iᵢ = {j∈{1,…,n} | f(xʲ) = i} be the set of indexes of the training set whose predicted labels are i.

For every class i, the conditional probability P(Y=i|X=x) can be approximated by the following quotient of kernel density estimators:Consequently, we can define a classifier whose components are estimations of P(Y=i|X=x):Note that we can use the trained classifier f to generate as much labeled data as we want for constructing this approximated (probabilistic) classifier.

And now that we have a probabilistic classifier (which is an approximation of our original hard classifier), we can use the Bayes rule in order to get a predicted class as we did in the previous section:Then, we can define an estimated explanation vector of a data point y∈ℝᵖ as follows:where f: ℝᵖ →{1,…,k} is our “hard” classifier.

Note that the actual predicted class f(y) is used instead of the predicted class of the approximated probabilistic classifier.

This is done to ensure that the estimated explanation vector points in the direction to where the observation has to be moved to change the actual label predicted by f, instead of the one assigned by the approximated classifier (which could be different).

Finally, the single hyperparameter σ should be chosen such that the “new” predicted class labels (i.

e.

, the final assigned classes by the approximated probabilistic classifier) are as similar as possible to the actual predictions made by the original hard classifier f on a test set.

Formally, if z¹, …, zᵐ∈ℝᵖ are test data points and f(z¹), … ,f(zᵐ) are their respective labels predicted by the classifier.

Then, the chosen value for σ is:.