Model-based feature importance

Model-based feature importanceVishal SinghBlockedUnblockFollowFollowingJan 3In an earlier post, I discussed a model agnostic feature selection technique called forward feature selection which basically extracted the most important features required for the optimal value of chosen KPI.

It had one caveat though — large time complexity.

In order to circumvent that issue feature importance can directly be obtained from the model being trained.

In this post, I will consider 2 classification and 1 regression algorithms to explain model-based feature importance in detail.

Logistic RegressionAn inherently binary classification algorithm, it tries to find the best hyperplane in k-dimensional space that separates the 2 classes, minimizing logistic loss.

Logistic loss expressionThe k dimensional weight vector can be used to get feature importance.

Large positive values of w_j signify higher importance of the jth feature in the prediction of positive class.

Large negative values signify higher importance in the prediction of negative class.

This can be seen from the expression of logistic loss.

SGD reduces loss by setting learning large positive weights for features more important in predicting a data point to belong to the positive class and similarly for negative class.

In order to illustrate the above concept, let us try to extract the top 4 features from Glass Identification dataset having 9 attributes, namelyRefractive index% Na content% Mg content% Al content% Si content% K content% Ca content% Ba content% Fe contentAll being real-valued.

Entire code can be found here.

The objective is to predict the type of glass from amongst 7 classes, given above features.

In order to keep things simple to understand, only points belonging to 2 classes are chosen.

This leaves us with a binary classification problem making logistic regression an ideal candidate to solve it.

Final dataset has 146 points each having 9 attributes.

The objective is to predict whether given glass composition is float-processed or non-float-processed.

EDA reveals the scale difference between features, hence min-max scaling is used to squash all feature values in the interval [0, 1].

Next, the best hyperparameter is searched using grid search on C(the term multiplying logistic loss, find more here).

The best hyperparameter offers a log loss of 0.

54 and an accuracy of about 70%.

Not bad considering the small amount of data that we have.

The most important features as found using parameters learned by SGD are enumerated here for convenience.

Random Forest ClassifierRandom forest is an ensemble model using decision trees as base learners.

The base learners are high variance, low bias models.

The variance of the overall model is reduced by aggregating the decisions taken by all base learners to predict the response variable.

The idea is to ensure that each base learner learns a different aspect of data.

This is achieved via both row and column sampling.

In a classification setting the aggregation is done by taking a majority vote.

At each node of a decision tree, the feature to be used for splitting the dataset is decided based on information gain(I.


) or the more computationally cheap Gini impurity reduction.

The feature that maximizes I.


(or reduction in Gini impurity) is selected as the splitting feature.

Data is then divided amongst its children according to the value of splitting feature.

If the feature is categorical, data belonging to each category of splitting feature goes to a separate child.

In the case of a numerical feature, the best threshold value of the feature (the one used to decide in favor of this feature to be used as splitting feature) is used to split data into two parts, each going to one child.

f_j is chosen as splitting featureInformation Gain due to a feature summed across all the levels of decision tree determines its feature importance.

This can also be seen from the fact that at every node splitting is done on the feature which maximizes Information Gain.

Random forests comprise multiple decision trees, thus the feature importance of feature j is the normalized sum of I.


brought about by feature j across all trees.

Let us head back to the Glass Identification dataset and see what features are deemed important by Random Forest.

Random forest performs significantly better than logistic regression at solving this task.

It gives above 90% accuracy and 0.

22 log-loss.

Scikit-learn’s random forest model has a feature_importance_ attribute that gives the value of Gini impurity reduction caused by each feature across all levels normalized across trees.

The only hyperparameter of interest here is the number of base learners.

A grid search is performed which gives 33 base learners as an optimal value.

Random forests seldom overfit, usually, they saturate with an increasing number of base learners, increasing computational overhead without deteriorating performance.

The results have been summarised here.

Bayesian linear modelFirst a little introduction to Bayesian models and how they differ from conventional machine learning models used very frequently (also called frequentist models) such as KNN, logistic regression, SVM, decision trees etc.

What most of us are used to is thisInference using frequentist modelBayesian models instead of predicting 1 output for a given input, give a distribution from which output may be sampled.

A Bayesian model shows the amount of uncertainty in the response variable by giving its probability distribution.

Usually, the response variable is modeled as Gaussian.

Inference using Bayesian modelHere β (model parameters or weight vector) is also sampled from distributions — one for each parameter.

As stated here The objective is to determine the posterior probability distribution for the model parameters given the inputs, X, and outputs, y.

Courtesy: Will KoehrsenPosterior probability distributions are estimated by taking samples using sampling methods like Markov Chain Monte Carlo (MCMC).

As more samples are drawn, the estimates converge to the true value.

Mean values of model parameters obtained from posterior distributions can be used to make point estimates about them and in turn about the mean of the normal distribution of the response variable.

Response variable can itself be estimated using its mean value.

The certainty of these estimates depends on individual distributions and increases with the amount of data fed to the model.

The model weights can be interpreted in a manner similar to that of linear regression or logistic regression.

Higher the absolute value of a feature weight, more is its importance.

In case two or more features have similar weights, the one whose value is more certain as indicated by its distribution, should be given higher importance, since the model is more confident about its value than it is about others.

Using the above insights, let us examine which features are more important for predicting published rating of an electronic device given a bunch of information about it.

The dataset can be found here.

Using 2 chains and 2000 draws from the posterior, the following distributions were obtainedSince the data is very less (167 train points) we have a high standard deviation, showing high model uncertainty.

As the data increases i.


as we gather more evidence the model estimate improves and eventually washes over the prior, which in this case is assumed to be Gaussian.

For infinite data, the Bayesian model converges to frequentist.

Mean weight values are used for getting feature importance.

That’s it for this post.

Do let me know other good techniques to get feature importance in the comments section.

Until next time…Farewell.








. More details

Leave a Reply