Introduction to Data Preprocessing in Machine Learning

We can’t say that Blue<Green as it doesn’t make any sense to compare the colors as they don’t have any relationship.The important thing to note here is that we need to preprocess ordinal and nominal categorical variables differently.Handling Ordinal Categorical Variables —First of all we need to create a dataframe.df_cat = pd.DataFrame(data = [['green','M',10.1,'class1'], ['blue','L',20.1,'class2'], ['white','M',30.1,'class1']])df_cat.columns = ['color','size','price','classlabel']Here the columns ‘size’ and ‘classlabel’ are ordinal categorical variables whereas ‘color’ is a nominal categorical variable.There are 2 pretty simple and neat techniques to transform ordinal CVs.Using map() function —size_mapping = {'M':1,'L':2}df_cat['size'] = df_cat['size'].map(size_mapping)Here M will be replaced with 1 and L with 2.2..Using Label Encoder —from sklearn.preprocessing import LabelEncoderclass_le = LabelEncoder()df_cat['classlabel'] =class_le.fit_transform(df_cat['classlabel'].values)Here class1 will be represented with 0 and class2 with 1 .Incorrect way of handling Nominal Categorical Variables —The biggest mistake that most people do is that they are not able to differentiate between ordinal and nominal CVs.So if you the same map() function or LabelEncoders with nominal variables then the model will think that there is some sort of relationship between the nominal CVs.So if we use map() to map the colors like -col_mapping = {'Blue':1,'Green':2}Then according to the model Green > Blue which is again a senseless assumption so the model will give you results considering this relationship.So although you will get the results using this method they won’t be optimal.Correct way of handling Nominal Categorical Variables —The correct way of handling nominal CVs is to use One-Hot Encoding..The easiest way to use One-Hot Encoding is to use the get_dummies() function.pd.get_dummies(df_cat[['color','size','price']])Here we have passed ‘size’ and ‘price’ along with ‘color’ but the get_dummies() function is pretty smart and will comsider only the string variables..So it will just transform the ‘color’ variable.Now, you must be wondering what the hell is this One-Hot Encoding.So let’s try and understand it.One-Hot Encoding —So in One-Hot Encoding what we essentially do is that we create ’n’ columns where n is the number of unique values that the nominal variable can take.Ex — Here if color can take Blue,Green and White then we will just create three new columns namely — color_blue,color_green and color_white and if the color is green then the values of color_blue and color_white column will be 0 and value of color_green column will be 1 .So out of the n columns only one column can have value = 1 and the rest all will have value = 0.One-Hot Encoding is a pretty cool and neat hack but there is only one problem associated with it and that is Multicollinearity..As you all must have assumed that it is a pretty heavy word so it must be difficult to understand, so let me just validate your newly formed belief.Multicollinearity is indeed a slightly tricky but extremely important concept of Statistics..The good thing here is that we don’t really need to understand all the nitty-gritty details of multicollinearity, rather we just need to focus on how it will impact our model..So let’s dive into this concept of Multicollinearity and how it will impact our model.Multicollinearity and its impact —Multicollinearity occurs in our dataset when we have features which are strongly dependent on each other..Ex- In this case we have features -color_blue,color_green and color_white which are all dependent on each other and it can impact our model.The main impact it will have is that it can cause the decision boundary to change which can have a huge impact on the result of our model.In addition to that if we have multicollinearity in our dataset then we won’t be able to use our weight vector to calculate the feature importance.I think this much of information is enough in the context of Machine Learning however if you are still not convinced, then you can visit the below link to understand the maths and logic associated with Multicollinearity.12.1 – What is Multicollinearity?.| STAT 501As stated in the lesson overview, multicollinearity exists whenever two or more of the predictors in a regression model…newonlinecourses.science.psu.eduNow that we have understood what Multicollinearity is, let’s now try to understand how to identify it.The easiest method to identify Multicollinearity is to just plot a pairplot and you can observe the relationships between different features..If you get a linear relationship between 2 features then they are strongly co-related with each other and there is multicollinearity in your dataset.Pair PlotHere (Weight,BP) and (BSA,BP) are closely related..You can also use the correlation matrix to check how closely related the features are.Correlation MatrixWe can observe that there is a strong co-relation (0.950) between Weight and BP and also between BSA and BP (0.875).Simple hack to avoid Multicollinearity-We can use drop_first=True in order to avoid the problem of Multicollinearity.pd.get_dummies(df_cat[['color','size','price']],drop_first=True)Here drop_first will drop the first column of color..So here color_blue will be dropped and we will only have color_green and color_white.The important thing to note here is that we don’t lose any information, as if color_green and color_white are both 0 then it implies that the color must have been blue.So we can infer the whole information with the help of only these 2 columns, hence the strong co-relation between these three columns is broken.. More details