+ is 1.

5 and that is exactly the value of ζ in this case.

Understanding CAs I mentioned above C is a hyperparameter and it can be tuned effectively to avoid overfitting and underfitting.

As C increases the tendency of the model to overfit increasesAs C decreases the tendency of the model to underfit increasesWhy we use +1 and -1 for support vector planesIt is not necessary that we always choose +1 and -1.

So let’s choose any arbitrary value of k here.

Only restriction is that it should be greater than 0.

We can’t choose distinct values for our planes i.

e we can’t take +k1 and -k2 as we want our positive and negative planes to be equalliy distant from our plane ????Now our updated margin is -2*k / ||W||for k = 5 we get10/||W||So now we will use 10/||W|| instead of 2/||W|| that is the only difference and since k is a constant here therefore it doesn’t really matter what value we are choosing as it will not affect our optimisation problem.

So we use +1 and -1 for simplifying the mathematical calculations.

Loss Function —Hinge LossThe loss function used in SVM is hinge loss.

In simple terms we can understand hinge loss as a function whose value is non-zero till a certain point let’s say ‘z’ and after that point ‘z’ it is equal to zero.

We looked at the equation for Soft-Margin SVM.

Here the second term which contains ζ and C is the loss term.

Now we will look at how we derived this term.

Let Y(W^T * X + b) = Z — (i)// Here we are just substituting the value of Y(W^T * X + b) so that it is more readableSo from (i) we can say that If Z > = 1 then the point is correctly classified andIf Z < 1 then the point is misclassified If you didn’t understood the above substitution then let me clarify it further.

Suppose you have 2 points x1 and x2 where x1 is positive and x2 is negative.

Now for a point x2 which lies in the negative plane the value for (W^T * X + b) will be negative and its Y value will be -1.

So, Y*(W^T * X + b) = -1 * (-ve value) = +ve valueSimilarly for a positive point x1, (W^T * X + b) will be positive and its Y value will also be positive.

So, Y*(W^T * X + b) = +1 * (+ve value) = +ve value.

Now if you have another point x3 which is positive but lies in the negative plane then (W^T * X + b) will be negative but class label Y is still positive.

So, Y*(W^T * X + b) = +1 * (-ve value) = -ve valueSo the crux here is that Y*(W^T * X + b) will only be positive if the point is correctly classified and we have just substituted Y*(W^T * X + b) as Z.

Now that you are comfortable with the concept of Z (hopefully), let’s look at our loss function.

So our loss function is fairly simple and if you are not able to understand how it works then I will break it down for you.

As explained earlier -If Z >= 1 then the point is correctly classified andIf Z < 1 then the point is misclassifiedSo we will consider 2 cases here.

Case 1 — ( Z ≥1)If Z ≥1 then 1-Z will be less than 0 so Max(0, 1-Z ) = 0It makes sense intuitively as if Z≥1 then it mean we have correctly classified the point and therefore our loss is 0.

Case 2— ( Z < 1)If Z <1 then 1-Z will be greater than 0 so Max(0, 1-Z ) = 1-ZFinal StepAs we already know that -Y(W^T * X + b) = 1 — ζSo we can rewrite it as -1 – Y(W^T * X + b) = ζAnd Y(W^T * X + b) = ZSo 1-Z = ζAnd from the above cases we can see that the term that we want to minimise is 1-Z.

Equation 1This is exactly what we have written here.

We just substituted 1-Z with ζDual form of SVM —The above Equation 1 that we derived is the primal form of SVM.

However in order to leverage the power of kernels we use the dual form of SVMs.

So let’s look at the dual form of SVM.

Equation 2The reason for using the dual form of SVM is that it allows us to leverage the power of kernels which is a key feature of SVM and if you are not familiar with kernels than don’t worry about it too much, I will explain kernels in the next section.

But for now just understand that we use the dual form of SVM in order to leverage the power of kernels.

It is proven mathematically that Equation 2 is equivalent to Equation 1.

The mathematical proof of how we actually got to this dual form is beyond the scope of this article as it is a bit mathematically intense.

If you want to understand the maths behind it then you can checkout the following video —In this video Prof.

Patrick Henry Winston provided a brilliant mathematical explanation and I would highly suggest that you checkout this video to better understand the concept of SVMs.

The most important thing to note here is that the value of αi will be non-zero only for support vectors.

So we basically only care about the support vectors.

We can update our Equation 2 as follows —Equation 3Earlier we were using Xi^T .

Xj i.

e.

we were taking the dot product of Xi and Xj which is equivalent to Cosine similarity function.

So we can just replace this cosine similarity function with any other fucntion of Xi and Xj.

This is called the kernel trick.

Now let’s understand what the hell is a kernel.

Kernel and its types —In the above Equation 3 we can replace K with any kernel function.

Now you must be wondering how does that change anything.

Why does it even matter which function we use.

So let’s try to answer those questions.

Suppose you have a dataset which is not linearly separable.

Non-linearly separable dataNow how would you use SVM to separate this data.

There is no way that we can fit a plane which can separate these 2 classes.

Enter Kernels….

The main use of a kernel function is that it allows us to project our dataset onto higher dimension where we can fit a plane to separate our dataset.

So we can project our above dataset onto a higher dimension and then we can find a plane that can separate the 2 classes.

This is exactly the reason why SVM was super popular in the early 90s.

Types of Kernels —The 2 most popular types of kernels are —Polynomial KernelRadial Basis Function (RBF) KernelPolynomial Kernel —So for a quadratic kernel we will have something like this —RBF Kernel —Here d is the distance between x1 and x2 i.

e.

d = ||x1-x2|| and ????.is a hyperparameter.

nu-SVM —nu is a hyperparameter that we can use for defining the percentage of error that is acceptable.

0 <= nu <= 1 // The value of nu is between 0 and 1Let's understand it with an example.

Suppose nu = 0.

01 and N (Number of data points)= 100,000* Percentage of errors <= 1%* Number of Support Vectors >= 1% of N i.

e.

1000So with the help of nu hyperparameter we can do 2 things —We can control the error percentage for our model.

We can’t control but we can determine the number of support vectors.

And with that we have come to the end of this article.

Thanks a ton for reading it.

You can clap if you want .

IT’S FREE.

My Twitter and LinkedIn .

.