A Practical Guide to Interpreting and Visualising Support Vector MachinesSVM’s are often considered ‘Black Boxes’.
In this article we cover techniques to visualise learned SVM models and their performance on real world data.
Hugo DolanBlockedUnblockFollowFollowingJan 12Image Shot by Hugo DolanAbout the authorHugo Dolan is an undergraduate Financial Mathematics student at University College Dublin.
This is mostly based and motivated by recent data analytics and machine learning experiences in the NFL Punt Analytics Kaggle Competition and the being part of the team who won the Citadel Dublin Data Open, along with material from Stanford’s CS229 online course.
This article contains the following sections:Introduction to Linear Models, SVM’s and KernelsInterpreting high dimensional engineered feature spaces which are utilising SVM kernels…Techniques for evaluating the performance of high dimensional classification boundariesPractical options for working with large class imbalancesHow much data you need to train an SVMSome assumptions I’ll make:This article will assume familiarity with basic ML models such as Logistic and Linear Regression.
It will also assume you know how to draw some of the graphs I discuss (I have a guide for this if your stuck!).
We will also assume that you know what a decision function and objective / cost function is and that you have some basic linear algebra knowledge.
If not its still worth a read you can always bookmark this and come back later for a deeper pass on some of the more mathematical parts of this article.
Introduction to Linear Models, SVM’s and KernelsIn machine learning linear classifiers are any model in which there is a single hypothesis function which maps between model inputs and predicted outputs.
Many models such as Logistic regression, Naive Bayes and Discriminant Analysis to name a few, are all examples of linear models.
The primary advantage of linear models over neural networks (a non linear model) is that the feature weights directly correspond to the importance of the feature within the model.
Thus it is easy to understand what the model has ‘learned’.
Training a L1 Regularised Regression model it is immediately obvious that most of our features in our dataset are totally irrelevant to predicting our output.
It is clear that features 0,1 make positive contributions to the model, whilst the presence of features 2,3,4 in a given example result in negative contributions to the outputAt the core of any linear model is a dot product between the input example and the parameter / weight vector.
In the case of linear regression this is the entire hypothesis function.
Where as logistic regression feeds the dot product through a sigmoid function such that the output is between 0 and 1 and hence is suitable for binary classification problems.
When considering classification problems the downfall of linear models is that ultimately the decision boundary is a straight line, plane or hyperplane with coefficients equal to the models weights / parameters and thus can only classify data which is linearly separable, which could be a big limitation when working on more complex analytics problems.
As we can see the simple linear model cannot separate the two ‘Noisy Hyperbola’ as it can only fit a ‘straight’ plane / line through the data.
The second example uses a non linear model (actually a kernel trick, we’ll get to this soon)The Support Vector Machine (SVM) is the only linear model which can classify data which is not linearly separable.
You might be asking how the SVM which is a linear model can fit a linear classifier to non linear data.
Intuitively with a simple linear regression model we may manually engineer x, x², x³,… features to attempt to achieve a fit to a non linear set of data points.
Whilst feature X is the only independent variable we have to predict y, which has an inherently non linear relationship with x, we can engineer x² and x³ features to improve our fit to y.
Transferring this intuition to our SVM, when we engineer the x² feature we are essentially multiplying feature x by itself.
So suppose we engineer features from our dataset by multiplying combinations of features x1,x2,x3… together, then theoretically we *could* end up with a space in which your engineered features are linearly separable.
Taking the previous simple example look at how the data below is transformed to an almost linear trend in the x³ feature space.
As a further intuition towards the previous example, we can see by transforming the x-axis from the original feature space of x up to the feature space of x³ how the model could be seen as a linear relationship between x³ and y.
Unfortunately to achieve this with a complex dataset requires creating more than just a 3 dimensional space (features x, x²,x³) but in fact extremely high dimensional feature spaces which would be computationally very expensive to calculate for every example in our dataset.
Below I show an example of a function ø(x) which takes our original features x and combines them to create many 2nd order polynomial features.
Before we proceed: I will use the notation of x to denote data points / training examples with superscripts to denote a particular data point and subscripts to denote a particular feature.
This is a typically high dimensional space, if we had 100 features originally then this function would produce 100 * 100 engineered features.
This is computationally expensive, in this case Big-O(n²) time complexity, think of having to write two nested for-loops to generate all the combinations produced by ø(x).
Luckily for us there is a way out of this computational complexity conundrum!.When we derive the optimisation problem for an SVM (The complex looking formula which tells us how to derive and update our weights for maximisation during coordinate ascent) it turns out that our feature vectors for our training input x appears in only a single place within the entire optimisation formula (highlighted in red).
This dot product was for our original feature space so now lets replace it with our engineer feature space using our function ø.
So how does this help reduce the computational complexity?.Well by definition of the dot product, we take the i-th entry of ø(x(i)) and multiply it by the i-th entry of ø( x(j)) and then sum all of these up to obtain a single scaler.
Applying this we get:As if by magic we can remove the need to compute ø(x) completely by simple algebraic manipulation by the kernel trick.
Now we have all the benefits of a high dimensional feature space without the additional computational complexityThe kernel trick is a really simple rearrangement of the original equation we can see that we’ve totally removed ø(x) and only have to perform computations using our original input features, but still have the effect of computing a high dimensional space.
All we have to do now is substitute the dot product involving ø(x) with the kernel equivalent ( K(x^i, x^j) ):Simple substitution, note that our Kernel is using x and z here just to remove the superscript notation.
Similarly when we want to use our model to make predictions we never explicitly calculate the weights for our high dimensional space but instead use the kernel trick to make our predictions:In summary we can use the kernel trick to transform a non linear data set into a data set which is linearly separable, just in a higher dimensional space.
Sklearn come prepackage with a number of kernels in the SVC implementation, including Radius Basis Kernel (RBF) and Polynomial Kernels, each have their own hyper parameters which can be adjusted experimentally using cross validation to achieve the best results.
A slight hitch, interpreting a high dimensional engineered feature space…So remember how we said the great benefit of a linear model was that the weights / parameters of the model could be interpreted as the importance of the features.
Well thats gone once we engineer a high or infinite dimensional feature set, the weights of the model implicitly correspond to the high dimensional space which isn’t useful in aiding our understanding.
Instead what we can do is fit a logistic regression model which estimates the probability of label y being 1, given the original features, where f(x) is the SVM decision function:If this looks familiar it is, think logistic regression!We use maximum likelihood estimation to fit the parameters of this logistic regression model, the technique is called Platt Scaling, the original paper  is definitely worth reading if your curious about the inner workings.
So how does this help us understand how the SVM works?.Well we simply fit the model and select a point in our dataset to evaluate it at, then perturb one feature at a time through a range of values, whilst keeping the other features fixed.
We can use this to draw a graph of the sensitivity of the model to each feature.
SKlearn comes with this feature built into the SVC model, you just have to make sure the probability=true, when initialising and then use the clf.
predict_proba(X) function to obtain the probabilities.
In practice I found that rather than just evaluating round a single point it is often better to sample a collection of relevant points eg.
40 negative examples and average the probability distribution by feature, to get something more representative.
Here’s an example I did when I was working on the NFL Punt Analytics Kaggle Competition, examining effects of various factors on concussions:I took all the negative examples and averaged the probabilities across them, i’ve highlighted the areas in red for each feature where players were most likely to receive a concussion.
A trick if you’ve a bunch of one hot encoded variables like player role, is to aggregate them into a bar chart and just look at the net change in probability between when the feature is present and not present.
There’s also a great example applied of this applied to marketing data  which you can find here.
I’d also like to thank Georgi who took the time to answer some of my questions about the paper.
Techniques for evaluating performanceWhen your dealing with a high dimensional models involving an SVM, it would be nice to be able to visualise how the model is classifying data points without purely relying on metrics like F1 scores or ROC AUC.
Whilst some may use techniques such as principle component analysis to visualise classification, in doing so we loose dimensions of our feature space and thus distorts the visual we are looking to achieve.
A nice technique I found is called ‘Histogram of projects’ , it involves graphing the distribution of output of the SVM decision function for your training and test sets.
The decision function is easy to obtain in SKlearn’s SVC implementation simply call decision_function(X).
You will want to keep track of your datasets labels so you can colour code your histogram of projections as below:A histogram of projection is fairly simple to interpret.
The histogram x axis identifies the distance of a specific training example from the decision boundary of the SVM (indicated by the central dashed line).
An SVM has a margin of separation equal to 1 either side of the the decision boundary, this is an enforced constraint of the dual optimisation problem (The ‘Support Vectors’ are data points which lie along these margins).
You’ll notice in the above model there is some leakage into the margin areas and indeed cross over from one class into the class on the opposite side of the decision boundary.
This is because we’ve set the regularisation hyper-parameter C > 0 (It allows a tradeoff between some misclassification and minimising the SVM objective function).
Despite working with high dimensional feature spaces, this chart visualises successfully visualises the decision boundary region and all classification without loss of dimensionality.
All the metrics seen in a confusion matrix (ie number of True Positives, False Positives, True Negatives and False Negatives) can also be seen through the histogram.
It also enables us to observe whether a model is generalising well to the test set.
If the test set has a similar distribution of decision function outputs to the training set then we might say that the model has good performance on generalisation.
The model may also be used to determine whether given the selected hyper-parameters whether the data set is linearly separable.
Practical Options For Dealing With Imbalanced DataWhen a dataset has a disproportionate number of examples for one class relative to another we say it is imbalanced.
An example of a real world dataset which has an imbalance of over 3000:1This is a problem if we want to build a ML model to predict occurrences of the minority, as we can achieve high levels of accuracy by simply misclassifying all minority examples as the majority class.
This occurs commonly in real world data, whether it is identifying malignant tissue, credit card fraud or concussions in a sport, due to the relative rarity of the incidents we wish to correctly identify.
There are two commonly accepted practices for rectifying ML models working with imbalanced data:Over-sampling the minority class / under-sampling the majority classIncreasing the weighting for minority examples in the cost functionOption 1 : SMOTEThere are two ways in which we can resample data, either by removing existing examples (Under-sampling) or adding new examples (Over-sampling).
The most commonly accepted method is to oversample the minority class using an algorithm called SMOTE (Synthetic Minority Oversampling Technique) Its much simpler than the name suggests, for each minority point in the dataset, it selects k nearest other minority examples (typically 5) and interpolates randomly new minority examples along the line ‘joining’ existing minority examples.
This is a reasonable thing to do as we are simply making the assumption that by interpolating between similar existing examples we will obtain new examples of the same class.
We can see how SMOTE has generated new data points along the lines of existing pointsThis tends to improve the performance of the model significantly and helps to generalises the decision boundary for minority examples.
Option 2 : Introducing weights into the objective functionAnother procedure which can be employed is to assign higher weights in the objective function for misclassification of minority examples.
This will ‘incentivise’ the algorithm to correctly classify minority classes.
I’ve no personal experience using this method, but it can be use in conjunction with option 1.
This is a link to a good paper here  which details many approaches to dealing with the class imbalanceHow much data do I need to train an SVMA reasonable rule of thumb is around 10x as many training examples as features as a minimum.
If you have a large amount of training data your best off using fewer than 50,000 training examples as the SVC implementation in sklearn has O(n³) complexity, meaning the time of convergence to a solution grows cubicly with the number of training examples, it can get pretty slow on even a decent laptop or kaggle container.
Its often worthwhile training on a smaller dataset first, and tuning your model’s hyper-parameter’s.
You can keep a small cross validation test set for model selection.
You could be surprised at how well your model generalises when you test on your remaining dataset, despite the small proportion of actual data you may have used.
Note: If your new to this, a tip is to use sklearn’s train test split module and to fix the random seed so that your results are reproducible if you happen to go back to edit some earlier code and rerun the training / model selection procedure.
ConclusionI hope this article has been helpful, if you’ve any comments / questions leave them below, i’ll try to answer them as best I can.
You can also find me on LinkedIn.
Paper References Nalbantov, Georgi & C.
Bioch, Jan & Groenen, Patrick.
Solving and Interpreting Binary Classification Problems in Marketing with SVMs.
 Cherkassky, Vladimir & Dhar, Sauptik.
Simple Method for Interpretation of High-Dimensional Nonlinear SVM Classification Models.
 Platt, J.
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.
Available at: http://citeseer.
 Batuwita, Rukshan & Palade, Vasile.
Class Imbalance Learning Methods for Support Vector Machines.
 Chawla, N.
, et al.
“SMOTE: Synthetic Minority Over-Sampling Technique.
” Journal of Artificial Intelligence Research, www.