Run cross-validation trying a set of different values and pick one that minimizes cross-validated error on test data.
Luckily, Python’s scikit-learn can do this for us.
LASSOLasso, or Least Absolute Shrinkage and Selection Operator, is very similar in spirit to Ridge Regression.
It also adds a penalty for non-zero coefficients to the loss function, but unlike Ridge Regression which penalizes sum of squared coefficients (the so-called L2 penalty), LASSO penalizes the sum of their absolute values (L1 penalty).
As a result, for high values of λ, many coefficients are exactly zeroed under LASSO, which is never the case in Ridge Regression.
Another important difference between them is how they tackle the issue of multicollinearity between the features.
In Ridge Regression, the coefficients of correlated variables tend be similar, while in LASSO one of them is usually zeroed and the other is assigned the entire impact.
Because of this, Ridge Regression is expected to work better if there are many large parameters of about the same value, i.
when most predictors truly impact the response.
LASSO, on the other hand, is expected to come on top when there are a small number of significant parameters and the others are close to zero, i.
when only a few predictors actually influence the response.
In practice, however, one doesn’t know the true values of the parameters.
So, the choice between Ridge Regression and LASSO can be based on out-of-sample prediction error.
Another option is to combine these two approaches in one — see the next section!LASSO’s loss function looks as follows:Unlike in Ridge Regression, this minimization problem cannot be solved analytically.
Fortunately, there are numerical algorithms able to deal with it.
Elastic NetElastic Net first emerged as a result of critique on LASSO, whose variable selection can be too dependent on data and thus unstable.
Its solution is to combine the penalties of Ridge Regression and LASSO to get the best of both worlds.
Elastic Net aims at minimizing the loss function that includes both the L1 and L2 penalties:where α is the mixing paramter between Ridge Regression (when it is zero) and LASSO (when it is one).
The best α can be chosen with scikit-learn’s cross-validation-based hyperparaneter tuning.
Least Angle RegressionSo far we have discussed one subsetting method, Best Subset Regression, and three shrinkage methods: Ridge Regression, LASSO and their combination, Elastic Net.
This section is devoted to an approach located somewhere in between subsetting and shrinking: Least Angle Regression (LAR).
This algorithm starts with a null model, with all coefficients equal to zero, and then works iteratively, at each step moving the coefficient of one of the variables towards its least squares value.
More specifically, LAR starts with identifying the variable most correlated with the response.
Then it moves the coefficient of this variable continuously toward its leasts squares value, thus decreasing its correlation with the evolving residual.
As soon as another variable “catches up” in terms of correlation with the residual, the process is paused.
The second variable then joins the active set, i.
the set of variables with non-zero coefficients, and their coefficients are moved together in a way that keeps their correlations tied and decreasing.
This process is continued until all the variables are in the model, and ends at the full least-squares fit.
The name “Least Angle Regression” comes from the geometrical interpretation of the algorithm in which the new fit direction at a given step makes the smallest angle with each of the features that already have non-zero coefficents.
The code chunk below applies LAR to the prostate data.
Principal Components RegressionWe have already discussed methods for choosing variables (subsetting) and decreasing their coefficients (shrinkage).
The last two methods explained in this article take a slightly different approach: they squeeze the input space of the original features into a lower-dimensional space.
Mainly, they use X to create a small set of new features Z that are linear combinations of X and then use those in regression models.
The first of these two methods is Principal Components Regression.
It applies Principal Components Analysis, a method allowing to obtain a set of new features, uncorrelated with each other and having high variance (so that they can explain the variance of the target), and then uses them as features in simple linear regression.
This makes it similar to Ridge Regression, as both of them operate on the principal components space of the original features (for PCA-based derivation of Ridge Regression see  in Sources at the bottom of this article).
The difference is that PCR discards the components with least informative power, while Ridge Regression simply shrinks them stronger.
The number of components to reatain can be viewed as a hyperparameter and tuned via cross-validation, as is the case in the code chunk below.
Partial Least SquaresThe final method discussed in this article is Partial Least Squares (PLS).
Similarly to Principal Components Regression, it also uses a small set of linear combinations of the original features.
The difference lies in how these combinations are constructed.
While Principal Components Regression uses only X themselves to create the derived features Z, Partial Least Squares additionally uses the target y.
Hence, while constructing Z, PLS seeks directions that have high variance (as these can explain variance in the target) and high correlation with the target.
This stays in contrast to the principal components appraoch, which focuses on high variance only.
Under the hood of the algorithm, the first of the new features, z1, is created as a linear combination of all features X, where each of the Xs is weighted by its inner product with the target y.
Then, y is regressed on z1 giving PLS β-coefficients.
Finally, all X are orthogonalized with respect to z1.
Then the process starts anew for z2 and goes on until the desired numbers of components in Z is obtained.
This number, as usual, can be chosen via cross-validation.
It can be shown that although PLS shrinks the low-variance components in Z as desired, it can sometimes inflate the high-variance ones, which might lead to higher prediction errors in some cases.
This seems to be the case for our prostate data: PLS performs the worst among all discussed methods.
Recap & ConclusionsWith many, possibly correlated features, linear models fail in terms of prediction accuracy and model’s interpretability due to large variance of model’s parameters.
This can be alleviated by reducing the variance, which can only happen at the cost of introducing some bias.
Yet, finding the best bias-variance trade-off can optimize model’s performance.
Two broad classes of approaches allowing to achieve this are subsetting and shrinkage.
The former selects a subset of variables, while the latter shrinks the coefficients of the model towards zero.
Both approaches result in a reduction of model’s complexity, which leads to the desired decrease in parameters’ variance.
This article discussed a couple of subsetting and shrinkage methods:Best Subset Regression iterates over all possible feature combination to select the best one;Ridge Regression penalizes the squared coefficient values (L2 penalty) enforcing them to be small;LASSO penalizes the absolute values of the coefficients (L1 penalty) which can force some of them to be exactly zero;Elastic Net combines the L1 and L2 penalties, enjoying the best of Ridge and Lasso;Least Angle Regression fits in between subsetting and shrinkage: it works iteratively, adding “some part” of one of the features at each step;Principal Components Regression performs PCA to squeeze the original features into a small subset of new features and then uses those as predictors;Partial Least Squares also summarizes orignal features into a smaller subset of new ones, but unlike PCR, it also makes use of the targets to construct them.
As you will see from the applications to the prostate data if you run the code chunks above, most of these methods perform similarly in terms of prediction accuracy.
The first 5 methods’ errors range between 0.
467 and 0.
517, beating least squares’ error of 0.
The last two, PCR and PLS, perform worse, possbily due to the fact that there are not that many features in the data, hence gains from dimensionality reduction are limited.
Thanks for reading! I hope you have learned something new :)SourcesHastie, T.
, Tibshirani, R.
, & Friedman, J.
The elements of statistical learning: data mining, inference, and prediction.
New York: Springer.