If former is the case, we need to filter variables on the basis of their prediction power or influence on the target variable.
Point Biserial CorrelationIf the target variable is binary, point biserial correlation is one good way to select variables.
A point-biserial correlation is used to measure the strength and direction of the association that exists between one continuous variable and one dichotomous variable.
It is a special case of the Pearson’s product-moment correlation, which is applied when you have two continuous variables, whereas in this case one of the variables is a nominal binary variable.
Like other correlation coefficients, the point biserial ranges from 0 to 1, where 0 is no relationship and 1 is a perfect relationship.
We calculate this coefficient for every continuous variable with the target variable and then select the top highly correlated variables based on a threshold coefficient (say 0.
5 for example).
Using this, I was able to filter out 50 variables from around 15,000 variables in my dataset.
Chi SquareNext technique is the chi-squared test — which is used to test the independence of two events.
If a dataset is given for two events, we can get the observed count and the expected count and this test measures how much both the counts are derivate from each other.
The Chi Square statistic is commonly used for testing relationships between categorical variables.
The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population.
You can read more about how to conduct the test here : https://www.
So, i wont go into the details of the test but the idea is to filter the independent categorical variables using the test.
Feature VarianceAnother technique used to filter out variables that won’t add value to our model is the variance threshold method which removes all features whose variance does not meet some threshold.
This method can be used before any of the above methods as a starting point and doesn’t really have much to do with the target variable.
Generally, it removes all the zero-variance features which means all the features that have the same value in all samples.
Feature ImportanceWe can use the feature importance feature of the tree based classifier model to get the importance of each feature with respect to the target variable:We can select the top n features based on feature importance which will give good results with any classification algorithm we might want to use.
Linear Discriminant AnalysisNext, let us take a look at some transformation techniques where we won’t be able to retain our original variables but can get a much higher accuracy.
LDA and PCA are 2 such techniques and although PCA is known to perform well in a lot of cases, we will mostly focus on LDA and its application.
But first, lets quickly take a look at some of the main differences between LDA and PCA :Linear Discriminant Analysis often outperforms PCA in a multi-class classification task when the class labels are known.
In some of these cases, however, PCA performs better.
This is usually when the sample size for each class is relatively small.
A good example is the comparisons between classification accuracies used in image recognition technology.
Linear Discriminant Analysis Python helps to reduce high-dimensional data set onto a lower-dimensional space.
The goal is to do this while having a decent separation between classes and reducing resources and costs of computing.
We can use the explained variance ratio attribute of the LDA class to decide the number of components which best suits our needs.
Take a look :In case we are not interested in retaining the original feature set, LDA provides a nice way to transform our data to a select number of new dimensional space which is a better input for a classification algorithm.
Now, lets put together all the techniques we discussed for the non transformation case as a series of steps in the form of code:The function variable_selection returns a new dataframe containing only the columns selected after passing through various filters.
The inputs to the function such as variance threshold, point biserial coefficient threshold, chi-square statistics threshold and feature importance threshold should be chosen carefully to get the desired output.
Note that this is just a framework for variable selection while is essentially a manual exercise for the most part.
I wrote this code because I felt a lot of the process can be automated or semi-automated to save time.
My advice is to use this function as a starting point and make whatever changes you deem fit for a particular dataset and problem statement and then use it.
Good luck !.