Concept of Regularization

Because the a linear model has been trained on a non linear dataset.

Hence, fitting a linear line to a non linear data oversimplifies the model.

OverfittingOverfitting is a phenomena in which the model learns too much from the dataset.

An overfitting model performs really well with the data in which it was trained but it’s accuracy tends to decrease when performed on a new set of data.

Let us consider the same example.

In this case, let us use a Polynomial Regression model.

By using Polynomial Feature Transformer, the powers of each element in an array can be found until the specified power.

These value’s which are stored in an multi dimensional array can be used to feed the Polynomial Regression model.

As the specified power in the Polynomial Feature Transformer increases, the model appears to learn more than required and appears to be an output that’s very “specific” to the training dataset.

An overfitting model is said to have high varianceoverfittingOverfitting phenomena occurs when the coefficients of basis functions are large and cancel out each other!Regularization prevents Overfitting!Regularization techniques are used to reduce the error by fitting a function appropriately on the given training set to avoid overfitting.

These functions essentially reduce the coefficients (β) of each feature thereby reducing the chances of the values getting cancelled out.

An example of a regression equation is as follows:Linear RegressionTypes of RegularizationL1L2Elastic-netL1 (Lasso Regularization):The idea behind L1 regularization is to reduce the dataset to only the most important features that would impact the “target variable”.

The L1 regularization adds a penalty equal to the sum of the absolute value of the coefficients.

By adding the above mentioned penalty, some of the coefficient of the features become 0 and the the remaining features will be the most useful ones.

This method of regularization can be seen as a method of feature selection.

L2 (Ridge Regularization):The L2 regularization appends a penalty equal to the sum of the squared value of the coefficients.

The λ in the equation is a hyper-parameter that controls the intensity of the penalty.

When λ→0, the results are similar to a linear regressionWhen λ→∞, all features are reduced to 0.

As the penalty is applied, the coefficients doesn’t undergo a drastic change to 0, rather it slowly reduces to 0.

Hence L2 cannot be used for feature selection unlike L1.

In both the cases,Bigger the penalization, smaller the coefficients become.

Elastic-net:Elastic-net regularization is a combination of both L1 and L2 regularization.

The penalty applied (P) is as follows:The λ in this case is a shared parameter which sets the ratio between L1 and L2.

So, the result would be a hybrid of L1 and L2 regularization.

The geometrical representation of the regularization’s is shown below:Implementation of RegularizationImport the required libraries:import pandas as pdimport numpy as npimport matplotlib.

pyplot as plt%matplotlib inlineimport warningswarnings.

filterwarnings('ignore')from sklearn.

preprocessing import PolynomialFeaturesfrom sklearn.

linear_model import Lasso,Ridgefrom sklearn.

pipeline import make_pipelineLoad the dataset (For this example, the Ames Housing dataset has been used) :data = pd.



head()Split the data into ‘train data’ and ‘test data’:from sklearn.

model_selection import train_test_splitX = data.

iloc[:,:-1]y = data.

SalePriceX_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 9,train_size = 0.


To perform Linear Regression:from sklearn.

linear_model import LinearRegression,Lasso,Ridgefrom sklearn.

metrics import mean_squared_errorregressor = LinearRegression() #Linear modelregressor.

fit(X_train,y_train)y_pred = regressor.


To perform Lasso Regression:lasso_model = Lasso(alpha = 140,max_iter = 100000, random_state=9)lasso_model.

fit(X_train,y_train) #Lasso modely_pred = lasso_model.


To perform Ridge Regression:ridge_model = Ridge(alpha = 0.

00002,max_iter = 100000,random_state = 9)ridge_model.

fit(X_train,y_train) #Ridge modely_pred = ridge_model.

predict(X_test)mean_squared_error(y_test,y_pred)In each case, the mean_squared_error is found between the predicted ‘target variable’ and the actual ‘target variable’ from the test data for model validation.

Thanks for reading this article.

Feel free to contact me via LinkedIn or you can mail me to my mail id on my LinkedIn profile.


. More details

Leave a Reply