As stated in the header of the file these are invalid and missing data, respectively.
Well, it’s time to face the truth about Data Analytics, there are no perfect datasets!The issue of missing values is one of the most common that every data analyst must face of, as well as the issue of invalid data.
Usually missing values are just blank spaces between separators in csv files, whereas invalid values must be detected “by hand” once imported the data, most of the times being outliers.
Once again I feel lucky working with these datasets because these abnormal values have been coded and, apparently, I should pay no attention to them.
Despite what we could think, missing values can be a valuable source of information, it’s just a question of paying attention to when do they occur and observe patterns to unveil what are they hiding.
Let’s consider again the Toronto Downtown air quality analysis for 2012.
The original data before changing the codes of the abnormal values, for the three pollutants under study, look like the figure below.
Since the range of values of the levels of the three pollutants between 0 and 100 it’s very easy to see the abnormal values in the plots.
What can we see from these plots?Both NO2 and PM25 have the same missing valuesPM25 has 78 periods of invalid values.
Most of them takes only one consecutive measure, being the longest one of 9 consecutive invalid measures.
On the other hand NO2 only has 4 periods of invalid measures, however one of these, in august, has 61 consecutive invalid measures.
3 out of 4 NO2 periods of missing values start at the same time than those for PM25.
Regarding the abnormal values of O3, we can see such a weird behaviour given its regularity.
The difference between consecutive invalid values is always 183.
The separation between the two first observations, that seem to be more separated in the plot, is 266 just the double!The difference between consecutive missing values is always 366.
The separation between the two groups of missing values is 1526, a number not divisible by 366, so in certain way this time we lose regularity.
At this point the curiosity of the Data Analyst should arise to pose many questions to clarify these insights and get valuable knowledge from them.
But all those missing values are not suitable to deal with, since in fact they are unknown information inappropriate for a science so precise as data analytics.
Once we have understood them, we need to carry out some actions on them to be able to best analyze these data and get the most accurate results.
The actions to be done will depend on their use and sense, taking into account that this process could affect the subsequent results.
Many strategies can be used to transform missing values: put all of them to 0, set them to the mean value, or the median, of all the values for this attribute in the dataset or just the nearest ones; use machine learning algorithms to predict their values; and so on.
Even you could consider more radical solutions to overcome missing values, just removing those rows containing them, however this strategy would not be suitable for time series analysis since all the values should be informed.
More often than not, dropping missing values isn’t the best idea because in doing so you may end up throwing away useful information, so keep in mind that whatever your strategy is you should be careful, consider all options and understand how each of them will affect your data and, as a consequence, your results.
Missing values in datasets are not desirable, however they can offer us valuable information.
After understanding their nature and their reason to be, we need to do some action with them in order to properly perform the required analysis.