Why using a mean for missing data is a bad idea.
Alternative imputation algorithms.
Kacper KubaraBlockedUnblockFollowFollowingJun 24Photo by Franki Chamaki on UnsplashW e all know the pain when the dataset we want to use for Machine Learning contains missing data.
The quick and easy workaround is to substitute a mean for numerical features and use a mode for categorical ones.
Even better, someone might just insert 0's or discard the data and proceed to the training of the model.
In the following article, I will explain why using a mean or mode can significantly reduce the model’s accuracy and bias the results.
I will also point you to few alternative imputation algorithms which have their respective Python libraries that you can use out-of-the-box.
Key fact to note is that the drawbacks of using a mean apply when the missing data is MAR (Missing At Random).
A great explanation of what MAR, MCAR, MNAR is can be found here.
Mean and mode ignore feature correlationsLet’s have a look at a very simple example to visualize the problem.
The following table have 3 variables: Age, Gender and Fitness Score.
It shows a Fitness Score results (0–10) performed by people of different age and gender.
Table with correct, non-missing dataNow let’s assume that some of the data in Fitness Score is actually missing, so that after using a mean imputation we can compare results using both tables.
Mean Imputation of the Fitness_ScoreImputed values don’t really make sense — in fact, they can have a negative effect on accuracy when training our ML model.
For example, 78 year old women now has a Fitness Score of 5.
1, which is typical for people aged between 42 and 60 years old.
Mean imputation doesn’t take into account a fact that Fitness Score is correlated to Age and Gender features.
It only inserts 5.
1, a mean of the Fitness Score, while ignoring potential feature correlations.
Mean reduces a variance of the dataBased on the previous example, variance of the real Fitness Score and of their mean imputed equivalent will differ.
Figure below presents the variance of those two cases:Fitness Score variance of the real and mean imputed dataAs we can see, the variance was reduced (that big change is because the dataset is very small) after using the Mean Imputation.
Going deeper into mathematics, a smaller variance leads to the narrower confidence interval in the probability distribution.
This leads to nothing else than introducing a bias to our model.
Alternative Imputation AlgorithmsFortunately, there is a lot of brilliant alternatives to mean and mode imputations.
A lot of them are based on already existing algorithms used for Machine Learning.
The following list briefly describes most popular methods, as well as few less known imputation techniques.
MICEAccording to , it is the second most popular Imputation method, right after the mean.
Initially, a simple imputation is performed (e.
mean) to replace the missing data for each variable and we also note their positions in the dataset.
Then, we take each feature and predict the missing data with Regression model.
The remaining features are used as dependent variables for our Regression model.
The process is iterated multiple times which updates the imputation values.
The common number of iterations is usually 10, but it depends on the dataset.
More detailed explanation of the algorithm can be found here.
KNNThis popular imputation technique is based on the K-Nearest Neighbours algorithm.
For a given instance with missing data, KNN Impute returns n most similar neighbours and replaces the missing element with a mean or mode of the neighbours.
The choice between mode and mean depends if the feature is a continuous or a categorical one.
Great paper for more in-depth understanding is here.
MissForestIt is a non-standard, but a fairly flexible imputation algorithm.
It uses RandomForest at its core to predict the missing data.
It can be applied to both continuous and categorical variables which makes it advantageous over other imputation algorithms.
Have a look what authors of MissForest wrote about its implementation.
Fuzzy K-means ClusteringIt is a less known Imputation technique, but it proves to be more accurate and faster than the basic clustering algorithms according to .
It computes the clusters of instances and fills in the missing values which dependns to which cluster the instance with missing data belongs to.
Python Imputation LibrariesUnfortunately, at the moment of writing none of these imputation methods are available from Scikit-Learn library.
There is quite a few Python libraries which are available for imputations anyway, but they don’t have that many contributors and the documentation might be sparse.
fancyimputeThe most popular Python package with approx.
700 stars on Github.
Their available imputation algorithms are: SimpleFill, KNN, SoftImput, IterativeSVD, MatrixFactorization, NuclearNormMinimiation, Biscaler.
Link is here.
impyuteHaving approximately 100 stars on Github, it can handle KNN, MICE, Expectation Maximization, Last Observation Carried Forward, Moving Window and WIP algorithms.
missingpyThough it is not the most often used package for imputation (30 stars), it implements MissForest and KNN ImputationsThank you for reading the article.
If you have any questions, please let me know.
You can also find me on Linkedin or go to my personal websiteReferences http://www.
org/paper/Towards-Missing-Data-Imputation%3A-A-Study-of-Fuzzy-Li-Deogun/ede4ae4e86e1f88efda17ed84bbc07a5d692033f.. More details