Feature Selection with sklearn and PandasIntroduction to Feature Selection methods and their implementation in PythonAbhini ShetyeBlockedUnblockFollowFollowingFeb 10Feature selection is one of the first and important steps while performing any machine learning task.

A feature in case of a dataset simply means a column.

When we get any dataset, not necessarily every column (feature) is going to have an impact on the output variable.

If we add these irrelevant features in the model, it will just make the model worst (Garbage In Garbage Out).

This gives rise to the need of doing feature selection.

When it comes to implementation of feature selection in Pandas, Numerical and Categorical features are to be treated differently.

Here we will first discuss about Numeric feature selection.

Hence before implementing the following methods, we need to make sure that the DataFrame only contains Numeric features.

Also, the following methods are discussed for regression problem, which means both the input and output variables are continuous in nature.

Feature selection can be done in multiple ways but there are broadly 3 categories of it:1.

Filter Method 2.

Wrapper Method 3.

Embedded MethodAbout the dataset:We will be using the built-in Boston dataset which can be loaded through sklearn.

We will be selecting features using the above listed methods for the regression problem of predicting the “MEDV” column.

In the following code snippet, we will import all the required libraries and load the dataset.

#importing librariesfrom sklearn.

datasets import load_bostonimport pandas as pdimport numpy as npimport matplotlibimport matplotlib.

pyplot as pltimport seaborn as snsimport statsmodels.

api as sm%matplotlib inlinefrom sklearn.

model_selection import train_test_splitfrom sklearn.

linear_model import LinearRegressionfrom sklearn.

feature_selection import RFEfrom sklearn.

linear_model import RidgeCV, LassoCV, Ridge, Lasso#Loading the datasetx = load_boston()df = pd.

DataFrame(x.

data, columns = x.

feature_names)df["MEDV"] = x.

targetX = df.

drop("MEDV",1) #Feature Matrixy = df["MEDV"] #Target Variabledf.

head()1.

Filter Method:As the name suggest, in this method, you filter and take only the subset of the relevant features.

The model is built after selecting the features.

The filtering here is done using correlation matrix and it is most commonly done using Pearson correlation.

Here we will first plot the Pearson correlation heatmap and see the correlation of independent variables with the output variable MEDV.

We will only select features which has correlation of above 0.

5 (taking absolute value) with the output variable.

The correlation coefficient has values between -1 to 1 — A value closer to 0 implies weaker correlation (exact 0 implying no correlation) — A value closer to 1 implies stronger positive correlation — A value closer to -1 implies stronger negative correlation#Using Pearson Correlationplt.

figure(figsize=(12,10))cor = df.

corr()sns.

heatmap(cor, annot=True, cmap=plt.

cm.

Reds)plt.

show()#Correlation with output variablecor_target = abs(cor["MEDV"])#Selecting highly correlated featuresrelevant_features = cor_target[cor_target>0.

5]relevant_featuresAs we can see, only the features RM, PTRATIO and LSTAT are highly correlated with the output variable MEDV.

Hence we will drop all other features apart from these.

However this is not the end of the process.

One of the assumptions of linear regression is that the independent variables need to be uncorrelated with each other.

If these variables are correlated with each other, then we need to keep only one of them and drop the rest.

So let us check the correlation of selected features with each other.

This can be done either by visually checking it from the above correlation matrix or from the code snippet below.

print(df[["LSTAT","PTRATIO"]].

corr())print(df[["RM","LSTAT"]].

corr())From the above code, it is seen that the variables RM and LSTAT are highly correlated with each other (-0.

613808).

Hence we would keep only one variable and drop the other.

We will keep LSTAT since its correlation with MEDV is higher than that of RM.

After dropping RM, we are left with two feature, LSTAT and PTRATIO.

These are the final features given by Pearson correlation.

2.

Wrapper Method:A wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria.

This means, you feed the features to the selected Machine Learning algorithm and based on the model performance you add/remove the features.

This is an iterative and computationally expensive process but it is more accurate than the filter method.

There are different wrapper methods such as Backward Elimination, Forward Selection, Bidirectional Elimination and RFE.

We will discuss Backward Elimination and RFE here.

i.

Backward EliminationAs the name suggest, we feed all the possible features to the model at first.

We check the performance of the model and then iteratively remove the worst performing features one by one till the overall performance of the model comes in acceptable range.

The performance metric used here to evaluate feature performance is pvalue.

If the pvalue is above 0.

05 then we remove the feature, else we keep it.

We will first run one iteration here just to get an idea of the concept and then we will run the same code in a loop, which will give the final set of features.

Here we are using OLS model which stands for “Ordinary Least Squares”.

This model is used for performing linear regression.

#Adding constant column of ones, mandatory for sm.

OLS modelX_1 = sm.

add_constant(X)#Fitting sm.

OLS modelmodel = sm.

OLS(y,X_1).

fit()model.

pvaluesAs we can see that the variable ‘AGE’ has highest pvalue of 0.

9582293 which is greater than 0.

05.

Hence we will remove this feature and build the model once again.

This is an iterative process and can be performed at once with the help of loop.

This approach is implemented below, which would give the final set of variables which are CRIM, ZN, CHAS, NOX, RM, DIS, RAD, TAX, PTRATIO, B and LSTAT#Backward Eliminationcols = list(X.

columns)pmax = 1while (len(cols)>0): p= [] X_1 = X[cols] X_1 = sm.

add_constant(X_1) model = sm.

OLS(y,X_1).

fit() p = pd.

Series(model.

pvalues.

values[1:],index = cols) pmax = max(p) feature_with_p_max = p.

idxmax() if(pmax>0.

05): cols.

remove(feature_with_p_max) else: breakselected_features_BE = colsprint(selected_features_BE)Output:['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']ii.

RFE (Recursive Feature Elimination)The Recursive Feature Elimination (RFE) method works by recursively removing attributes and building a model on those attributes that remain.

It uses accuracy metric to rank the feature according to their importance.

The RFE method takes the model to be used and the number of required features as input.

It then gives the ranking of all the variables, 1 being most important.

It also gives its support, True being relevant feature and False being irrelevant feature.

model = LinearRegression()#Initializing RFE modelrfe = RFE(model, 7)#Transforming data using RFEX_rfe = rfe.

fit_transform(X,y) #Fitting the data to modelmodel.

fit(X_rfe,y)print(rfe.

support_)print(rfe.

ranking_)Output:[False False False True True True False True True False True False True][2 4 3 1 1 1 7 1 1 5 1 6 1]Here we took LinearRegression model with 7 features and RFE gave feature ranking as above, but the selection of number ‘7’ was random.

Now we need to find the optimum number of features, for which the accuracy is the highest.

We do that by using loop starting with 1 feature and going up to 13.

We then take the one for which the accuracy is highest.

#no of featuresnof_list=np.

arange(1,13) high_score=0#Variable to store the optimum featuresnof=0 score_list =[]for n in range(len(nof_list)): X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.

3, random_state = 0) model = LinearRegression() rfe = RFE(model,nof_list[n]) X_train_rfe = rfe.

fit_transform(X_train,y_train) X_test_rfe = rfe.

transform(X_test) model.

fit(X_train_rfe,y_train) score = model.

score(X_test_rfe,y_test) score_list.

append(score) if(score>high_score): high_score = score nof = nof_list[n]print("Optimum number of features: %d" %nof)print("Score with %d features: %f" % (nof, high_score))Output:Optimum number of features: 10Score with 10 features: 0.

663581As seen from above code, the optimum number of features is 10.

We now feed 10 as number of features to RFE and get the final set of features given by RFE method, as follows:cols = list(X.

columns)model = LinearRegression()#Initializing RFE modelrfe = RFE(model, 10) #Transforming data using RFEX_rfe = rfe.

fit_transform(X,y) #Fitting the data to modelmodel.

fit(X_rfe,y) temp = pd.

Series(rfe.

support_,index = cols)selected_features_rfe = temp[temp==True].

indexprint(selected_features_rfe)Output:Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'PTRATIO', 'LSTAT'], dtype='object')3.

Embedded MethodEmbedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.

Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold.

Here we will do feature selection using Lasso regularization.

If the feature is irrelevant, lasso penalizes it’s coefficient and make it 0.

Hence the features with coefficient = 0 are removed and the rest are taken.

reg = LassoCV()reg.

fit(X, y)print("Best alpha using built-in LassoCV: %f" % reg.

alpha_)print("Best score using built-in LassoCV: %f" %reg.

score(X,y))coef = pd.

Series(reg.

coef_, index = X.

columns)print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")imp_coef = coef.

sort_values()import matplotlibmatplotlib.

rcParams['figure.

figsize'] = (8.

0, 10.

0)imp_coef.

plot(kind = "barh")plt.

title("Feature importance using Lasso Model")Here Lasso model has taken all the features except NOX, CHAS and INDUS.

Conclusion:We saw how to select features using multiple methods for Numeric Data and compared their results.

Now there arises a confusion of which method to choose in what situation.

Following points will help you make this decision.

Filter method is less accurate.

It is great while doing EDA, it can also be used for checking multi co-linearity in data.

Wrapper and Embedded methods give more accurate results but as they are computationally expensive, these method are suited when you have lesser features (~20).

In the next blog we will have a look at some more feature selection method for selecting numerical as well as categorical features.

.