Implementation of Backward feature Selection with Multiple Linear Regression

Implementation of Backward feature Selection with Multiple Linear RegressionPallavi NikamBlockedUnblockFollowFollowingMar 1For Prediction of Housing Prices using Boston Housing DatasetHousing Price Prediction DatasetThis tutorial includes the implementation of multiple linear regression with Backward Elimination Technique.

Caution: If you’re expecting details of either of multiple regression or backward elimination then you may be disappointed.

This is purely focused on implementation of backward elimination.

Still, I would prefer to give a short overview of backward elimination.

It is one of the Dimension Reduction Techniques.

There is common thought coming out from experts of ML is to keep your model as simple as possible.

Don’t fall for building complex models with all available data and I do agree with that.

Including all available data may turn into garbage in garbage out.

You need to care about the data you are providing to build the model.

It is necessary to identify redundant data and drop it.

There are a few techniques which help to achieve this, mentioned as Dimension reduction techniques.

Backward elimination one of that where we eliminate nonsignificant features one by one by analyzing the impact of the feature on target variable.

Let’s dive into detail process of implementation :I am using Boston Housing Dataset available at https://medium.





htmlThe dataset has 506 cases.

The dataset contains 13 features which may impact on housing prices.

The details of features are as followsDetails of Features/Attributes of Datasetfrom 1 to 13 are the predictors and 14th feature-MEDV is the target variable housing prices.

Data ProcessingStarting with processing the dataset and preparing it for model building.

There are no missing values in the dataset.

Therefore I am not dropping any rows or columns.

The dataset is of small size and does not require data normalization and standardization.

No categorical variable except ‘CHAS’ and that also encoded already.

It looks like the dataset does not need much processing .

Let’s visualize some features to get good insights of data.

The histogram of target Variable — housing prices looks as belowHistogram of Housing PricesThe Conclusion drawn at looking this histogram is that there is fair distribution of values except at the end , It seems that there is abnormal behavior at 50(POinted by blue mark).

it looks like housing prices are censored at 50.

To avoid biasing in model I choose to drop the observations which has prices equals to 50,000 $.

I have analyzed the behavior of each feature with pricing and come to the conclusion of using multiple regression model.

Almost every attribute shows weak or strong correlation with target variable.

I am not putting all scatter plot here as the details of each features and analysis of its relation with target variable will lenthen the article and I consider it as slightly out of topic here.

Lets get view of correlation with heatmapHeatmap -CorrelationThe conclusions of heatmap :“CHAS” which is dummy variable has 1 if track bounds charles river -has lowest corelation with housing prices.

“LSTAT”- % of lower status population has the highest correaltion.

The negative coefficient indicate that the houisng prices lowers in the area with higher percenatge of lower status population.

There are few features -(RAD and TAX) and (DIS and NOX) and (INDUS and NOX) has high correlation(referred as Multicolinearity)There is another technique used for dimension reduction in which we can choose to use highly correlated(with target variable) features for model building and drop features with low correation.

Also select one out of two , if they are highly correlated features (Multicollinearity).

But here we will not use this results for dimension reduction but to analyse that the Backward feature selection gives similar kind of results .

I hope we have got enough idea about dataset and I would go further with buiding model and implementing backward selection.

Let’s fit the model and calculate performance to compare the performance of the model before and after.

first spilting the data into train nd testX=BostonHousing_df.




values# splitting data into train and test datasetfrom sklearn.

model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.

25, random_state =0)using scikit learn library-LinearRegression() for model buildingfrom sklearn.

linear_model import LinearRegressionregressor = LinearRegression()regressor.

fit(X_train,y_train)for performance of model , I am using RMSE-Root Mean Square Error.

y_pred = regressor.

predict(X_test)print("RMSE: %.

2f"% np.

sqrt(((y_pred – y_test) ** 2).

mean()))The output is RMSE = 4.

24 .

starting with the process of Backward Feature selection to identified nonsignificant features.

The steps followed for Backward Feature Selection is as below and it is cyclic process until we get desired resultsProcess Flow For Backward Feature EliminationBefore going further I will briefly give idea about P value.

P value is one of thee parformance measure.

P value signifies the impact of feature on the target variable.

The high value of P parameter indicates that the feature is nonsignificant and has little or less impact on the target variable and low value of P indicates that the feature is significant and have large impact on target variable.

The satndard value of P is 0.

05 considered as threshold for deciding the consideration of feature in model building.

With P value , Adj R can be used as a performance parameter in backward elimination.

If there is an improvement in Adj R due to the elimination of feature, Confirms that the feature is nonsignificant.

The Stepwise Backward elimination -Model FittingUse all features to build the model in the beginning.

I am using al 13 features and storing in X_Opt array.

This will be optimal array and in the end, we desired to get an optimal array of features stored in X_optimport statsmodels.


api as sm# adding extra rows of ones for constant X1=np.


ones((490, 1),int).

astype(int),values = X,axis =1)# Optimal Array X_opt = X1[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]regressor_OLS = sm.

OLS(endog = y, exog = X_opt).


summary()Here for model using the OLS (Ordinary Least Square) method from stats model library.

Why I choose to use statmodels OLS() over scikit-learns LinearRegression() because scikit-learns LinearRegression() doesn’t calculate the performance parameter information.

It doesn’t have in built function and needs to build one where statmodel has the inbuilt function which gives all performance parameters with summary() function/Method.


PredictionThe output is as below and we can see that the X4 feature has the highest P value.

Performance Parameter summaryx4 refers to ‘CHAS’ variable — and as we have seen in the heatmap it has the lowest correlation with target variable.

And we have got similar output from Backward feature selection also.


Remove PredictorAs x4 is identified as a nonsignificant feature, we will remove that from the array and our optimal array X_opt is as belowX_opt = X1[:,[0,1,2,3,5,6,7,8,9,10,11,12,13]]4.

Fit the ModelThe next step is fit the model with new optimised array and find out if there are any nonsignificant variables need to remove.

regressor_OLS = sm.

OLS(endog = y, exog = X_opt).


summary()Performance Parameter SummaryAs we can see that there is features having above a threshold value, we will repeat the process.


1 PredictionThe x3 has the highest p value and it signfies that x3 is non significant.

x3 is for “INDUS” — proportion of non-retail business acres per town .

here the backward elimination suggest to eliminate.

We can corelate the observations of a heatmap , even it has a moderate value of corelation coefficient ewith target variable , it has high multicolinearity coefficient with a few other features .

If two variable has high correlation it means they are making the same impact on a target , dropping one of that will not make much difference.


1 Remove PredictorAs x3 is identified as a nonsignificant feature, we will remove that from the array and our optimal array X_opt is as belowX_opt = X1[:,[0,1,2,5,6,7,8,9,10,11,12,13]]4.

1 Fit the ModelNow we have to again fit the model with X_opt and see the results .

X_opt = X1[:,[0,1,2,5,6,7,8,9,10,11,12,13]]regressor_OLS = sm.

OLS(endog = y, exog = X_opt).


summary()Performance Parameter SummaryThe Summary table tells us that the remaining all features have P > 0.

05 i.

e significant level .

So the X_opt has all significant variables and we have desired features to buid our final model.

from sklearn.

model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_opt,y,test_size = 0.

25, random_state =0)from sklearn.

linear_model import LinearRegressionregressor = LinearRegression()regressor.

fit(X_train,y_train)As we have build the model lets predict the values of test data and calculate the performance of the Model.

y_pred = regressor.

predict(X_test)print("RMSE: %.

2f"% np.

sqrt(((y_pred – y_test) ** 2).

mean()))And the RMSE: is 4.

18We can see there is a decrease in RMSE which indicate that we have a better model with the elimination of nonsignificant features.

there is not a significant decrease and we still have scope to work on model performance.

there are a few disadvantages to this technique :It can be applied if the features are less max 15 and it is a time-consuming process.

one suggestion is we can build function for Backward Elimunation instead of repeating the same Process to get Optimised Matrix.

Feature selection cannot guarantee improved performance.

Still it is good to use in some cases.

I hope you have got enough understanding of the implementation of Backward feature elimination.

If you need any help regarding codes or you want to give any feedback please use comment section.

Thanks for reading !!.May peace be with you ????.. More details

Leave a Reply