An outlying black cat (photo by author)Using SQL to detect outliersRobert de GraafBlockedUnblockFollowFollowingJun 10SQL doesn’t have the features of a language like R or Python, but that doesn’t mean you can’t use it to perform an initial clean of your data by looking for abnormal points or outliers.
Many data scientists are used to a workflow where they suck up there data from a SQL compliant database before doing the real work in R or Python.
This is obviously inefficient as it means that flawed, incorrect or data that is ultimately unuseable for some other reason needs to be moved to the real anayltics environment.
While it is true that SQL lacks libraries of statistical tools and distributions, this doesn’t mean it can’t be used to do useful data analysis, just that the toolset is restricted to tools that you can code on the fly.
One important task that you can perform without much hassle in SQL once you’ve worked out how is detecting outliers and other data anomalies.
We will look at two ways that are friendly to code in SQL.
One such method is using the Median Absolute Deviation to detect data outliers.
This method is actually more robust than using z-scores as people often do, as it doesn’t make an assumption regarding the distribution of the data.
Conceptually, this method has the virtue of being very simple.
The Median Absolute Deviation of any observation is the ratio of the Absolute Difference of the observation and the median of all observations and median Absolute Difference of all observations and the median of all observations.
A critical value for the Median Absolute Deviation of 5 has been suggested by workers in Nonparametric statistics, as a Median Absolute Deviation value of just under 1.
5 is equivalent to one standard deviation, so MAD = 5 is approximately equal to 3 standard deviations.
Coding this is SQL, a disadvantage is that you will need to employ a Common Table Expression or other workaround because the approach requires further manipulation of the result of an aggregate function.
The other part that is mildly challenging, is that there isn’t a straightforward ‘median’ function in a major SQL implementation yet.
In the code below, I’ve used the ‘Percentile_Cont’ function, which I comment on more below.
There are three steps, needing two CTEs to capture intermediate steps —Find the median of the variable of interest.
You need this value for further calculations which being an aggregate is difficult without the CTE.
Calculation the median deviation.
The deviation is just the absolute difference between the median weight found above and the individual weight.
Find the median of this value.
Finally find the deviation ofHerewith the code, which creates a table of the weights (in kg) and ages (in years) of pre-schoolers, as might be recorded at pediatric practice, or municipal infant health service.
The final table shows the weight of each pre-schooler and its deviance from the median weight of all pre-schoolers in the table:CREATE TABLE KidWts (Name nvarchar(20),Age int ,Weight float);INSERT INTO KidWts VALUES(‘Lily’,3,15), (‘Muhammad’,30,98), (‘Daphne’, 3, 16), (‘Max’, 2, 12),(‘Chloe’,1,11),(‘Jackie’,2,14),(‘Albert’,3,17);WITH MedianTab (MedianWt)AS(SELECT PERCENTILE_CONT(0.
5) WITHIN GROUP (ORDER BY Weight)OVER () as MedianWeightFROM KidWts),DispersionTab (AbsDispersion)AS(SELECT PERCENTILE_CONT(0.
5) WITHIN GROUP (ORDER BY (Abs(Weight-MedianWt)))OVER () as AbsDispersionFROM MedianTab JOIN KidWts on 1=1)Select DISTINCT *,ABS((Weight-MedianWt)/AbsDispersion) FROM KidWts Join DispersionTab on 1=1JOIN MedianTab on 1=1This query returns the following results:The results of the query — Muhammad looks a little out of placeLooking at the table, one result sticks out — Muhammad’s 98kg doesn’t, 41.
5 deviances from the median, is way out of place.
After a little investigation, we realise that the cause is that the Muhammad in question, is Muhammad Ali, with 98kg being his weigh-in before his 1974 fight with George Foreman.
Not an exceptional weight for a heavyweight boxer, but right out of place for a preschooler.
Possibly a boxing tragic DBA used boxing stats to test out the databases functions before real data was entered and didn’t clean out the test data fully— an extreme example, but not impossible.
The MAD value of 41.
5 is likely larger than what you would see in real life, and intended to illustrate the kind of data that is simply in your dataset by mistake — one of the ways that an anomaly or outlier can arise in a real life data set.
Looking at the code in more detail, the Percentile_Cont function is in the latest ANSI SQL standard, but that doesn’t mean that it works in every implementation of SQL — I tested it on SQL Server, and my information is that this function is also available in Oracle and PostgreSQL.
MySQL doesn’t have this function, so you need to use an alternative method of finding the median.
You might like to choose your favourite from the following article — Joe Celko’s on the median in SQL- but be aware that it was written prior to CTEs, so that some of these solutions can be improved by replacing temporary tables or views with CTEs.
As the Joe Celko article illustrates, there are quite a few ways to calculate the median in SQL — find the one that works for you, and update the code accordingly.
Robert de Graaf is the author of Managing Your Data Science Projects, due for publication by Apress Publishing mid July 2019.
You can follow Robert on Twitter: https://twitter.
com/RobertdeGraaf2.. More details