For the love of regressionSyed MisbahBlockedUnblockFollowFollowingJan 3Regression is used to model the relationship between a dependent variable and one or more independent variables.
The idea is to predict a continuous value of a given data point by generalizing on data.
RefresherTo give a quick refresher on what a typical regression model looks like, here is an illustration.
In the illustration above,y — is the value of the dependent variableβ₀ — is the y-intercept of the line means where the line intersects the Y-axisβ₁ — The slope or gradient of the linex — The value of the independent variableu — The residual or noise that are caused by unexplained factorsThe cost function and gradient descentIn ML, cost functions are used to estimate how badly models are performing.
Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y.
Mean Squared ErrorThis is typically expressed as a difference or distance between the predicted value and the actual value.
The objective of a ML model, therefore, is to find parameters, weights or a structure that minimizes the cost function.
By tweaking the above equation a little we get the below equation:where J is cost notation and θ is the parameter, h(θ(x(i))) is the predicted value and Y(i) is the actual value of the ith observation.
Our goal is to reduce the Cost function, which in turns improves the accuracy.
Minimizing the cost function : Gradient descentGradient descent is an efficient optimization algorithm that attempts to find a local or global minima of a function.
Gradient DescentIf you want to read in detail about how gradient descent works, here is a simplified yet exhaustive explanation.
The fit and the splitUnderfitting and OverfittingWhen we use unnecessary explanatory variables it might lead to overfitting.
Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets.
It is also known as problem of high variance.
When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data.
It is also known as problem of high bias.
In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.
it will lead to large errors even in the training set.
Using a polynomial fit in fig 2 is balanced i.
such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.
SplittingA commonly confused concept is of the way data is split while modeling.
The ideal way to split data is shown in the above illustration, including 5 fold cross validation.
What is cross validation and why should I care about it?An ideal model is one which has got most of the patterns from the data correct, and is not picking up too much on the noise.
In other words its low on bias and variance.
However, while splitting the data set into train and test randomly, there exists a chance of introducing sampling bias.
If the model is trained on the randomly sampled(and hence, possibly biased) training set, it might overfit — in other words might not be a good generalized model.
It will perform poorly on data it hasn't seen before.
In K Fold cross validation, the data is divided into k subsets.
Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set.
The error estimation is averaged over all k trials to get total effectiveness of our model.
This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set.
Regularization — What, why and how?Before we move on to ridge, lasso and Elasticnet regression, it is important to know what regularization is and why it is important.
Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data.
Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.
Regularization addresses the trade off between model accuracy and complexity.
A highly complex model will tend to overfit, while a low complexity model will tend to underfit.
Regularization is generally useful in the following situations:Large number of variables — for e.
g multiple ACV columns in sales dataLow ratio of number observations to number of variablesHigh Multi-collinearityLet’s take a look at the two ways of regularizing your model.
L1 Regularization — Lasso regressionIn L1/Lasso regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients.
Thus the cost function in LASSO regression becomes:As we can see, absolute value of all the β’s are added to the cost function as penalty.
λ is the regularization parameter — the higher λ is, the more the penalty.
Additionally, the intercept term is not regularized.
A disadvantage of Lasso is that it shrinks all coefficients, large or small, by the same factor.
Hence, it needs standardization of variables.
L2 Regularization — Ridge regressionIn L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients.
Thus the cost function of ridge regression isRidge regression only suppresses the magnitude of betas, not dropping any variable.
Choosing λIf we choose lambda = 0 then we get back to the standard linear regression (OLS) estimates.
If lambda is chosen to be very large then it will lead to underfitting.
Thus it is highly important to determine a desirable value of lambda.
Elastic Net — the holy grail?Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.
It is a combination of both Lasso and ridge regression.
The cost function in case of Elastic Net Regression is:Elasticnet has an additional tuning parameter called α.
An α of 0.
5 produces an equal compromise between ridge and lasso.
When α = 1, the model is fully lasso, while α = 0 implies a fully ridge model.
What should I use then?There is no ‘one size fits all’ solution to every problem.
However, ElasticNet does come as close as possible to such a solution.
Ridge: Since ridge only suppresses betas, all its betas will be non-zero.
Meaning all variables are used to predict the response.
Lasso: Unlike ridge, Lasso sets many coefficients to zero.
This means that only a subset of our original variables are used to predict the response.
In this way lasso does prediction and variable selection (i.
finds our “important”/”relevant” variables)Elastic Net: Lasso is great.
It does prediction in high dimensional(high number of variables, correlated highly) situations and does variable selection (i.
sets certain beta components to zero).
But it has a problem.
If a set of predictor variables are highly correlated, then lasso tends to pick a single one from the correlated group; the particular variable selected may not be all that relevant.
ElasticNet will force groups of correlated variables to enter and exit together, aiding interpretation — in a way giving the best of both worlds.
The one possible issue with ElasticNet is that the hyper parameter search(alpha and lambda) is computationally intensive.
Bottom line: If you’re computationally rich, then the best bet is to use ElasticNet and get the most robust model.
Implementation in RThis post has become too long to contain an in depth implementation of Elastic net in R/Python.
I will be covering it in a separate post.
However, this link contains a beautiful and exhaustive implementation which should pretty much cover the basics.
I’ve tried to cover as much breadth as possible and given the relevant links for going in depth.
However, if you need any help with any of the topics mentioned above, feel free to reach out and ask for help.
Please do let our team @ Decoding Data know your thoughts, questions or suggestions in the comment section below.