How to deal with outliers in a noisy population?Dario MakaricBlockedUnblockFollowFollowingJun 11Defining outliers can be a straight forward task.
On the other hand, deciding what to do with them always requires some deeper study.
MotivationData can be noisy.
When you have a small (relative to population size), random sample of the population, especially noisy population, it can be quite a challenge, if not impossible, to build a model that would generalize well.
Imagine you built a simple linear model that performed poorly on your data.
Then, you decided to compute studentized residuals and remove all the observations that fall out of a certain scope.
As a result, your model fits the data far better.
Finally, you put it into production and the fit is a nightmare.
Noisy data is not a rare occurrence and it can greatly influence your findings.
Knowing how to recognize and how to deal with it is important for any accurate finding.
A great tool that will help us explore and better understand the issue is simulations.
Simulating outliers and a simple linear modelIf you haven’t already, you should build your own simulations.
It is a great way to cement what you have just learned from a statistic textbook or a course.
You can get your hands dirty with examples you get to create from scratch and even better, modify any way you like.
For this purpose, I like to use Rstudio.
Let’s say we are advising a used car dealership on the pricing of the used cars relative to their mileage.
Let’s assume that the two variables are size 100 and can be well modeled using simple linear regression.
I will define the true population regression line with the following coefficients:b0 <- 21345b1 <- -0,1349Let’s create the independent variable mileage by sampling it from a normal distribution with a mean 76950.
45 and a standard deviation 16978.
58: (These values are chosen based on a dataset available here.
)mileage <- rnorm(n = 100, mean = 76950.
45, sd = 16978.
58)Next, I will introduce noise with the random error term.
I will sample it from a normal distribution with the standard deviation equal to 500 and I will change a single value in order to create a suitable outlier.
eps <- rnorm(n = 100, mean = 0, sd = 500)eps <- 15000Random error term standard deviation has increased three-fold by changing this single value:> sd(eps) 1586,312Time for the dependent variable:price <- (b0 + b1*mileage + eps)Let’s plot the data:Figure 1: Scatter plot of mileage versus price showing the true regression line in black, the least-squares line in blue and two outliers colored yellow and red.
We can clearly see the two outliers in Figure 1.
In terms of the rest of the mileage values, both of them have unusual values.
Both of them are high-leverage points.
But, if we removed just the red dot we would see that it is the only influential point:Figure 2: Scatter plot after the removal of the influential point.
Although the yellow point is influencing the least-squares line by decreasing its slope, its influence is so small it can be ignored.
If we compare the two models’ statistics we can see some very different values.
For example, after we remove the influential point the residual standard error drops from 1518 to 514, which is very close to the actual values for error term standard deviation of 1586 and 500 respectively.
Here are the two plots of residuals:Figure 3: Left: Scatter plot of the residuals before removing the influential point.
Right: Scatter plot of the residuals after removing the influential point.
One thing that we don’t see in this example and it should be mentioned, is that sometimes an outlier that is not influencing the least-squares line can have a large residual.
This can still be a problem since, as we saw above, a large residual makes for a large residual standard error which we use to determine confidence intervals and p-values.
Solving the example issueNow we need to decide whether to remove the influential point altogether or not.
Always be extra careful about this problem.
You need a very good reason to remove it.
If we observe the R-squared of the two models, we would see a big difference.
The model with the influential point has R-squared of around 60% as opposed to 95% when the influential point is removed.
This difference might seem like a good enough reason for the removal, but on its own, it’s not.
In order to understand this better, we need to understand a simple fact that is often overlooked.
Our goal is not to predict the past, but the future.
The small window of the above data consists of 100 individual observations.
Every statistic calculated on that window is a point estimate of the true population statistic.
Simple linear regression is really good at capturing the relationship between two variables inside that window, but not necessarily outside.
You should be very careful with using your model for extrapolation.
When we choose one out of these two models, we essentially make an assumption about the population.
The model without the influential point RSE is three times smaller.
That means a 95% confidence interval is going to be three times narrower.
In essence, our predictions would be that much more confident.
There is nothing in the data telling us we can make such a confident assumption about the population.
In order to get the idea of what the population looks like we need to do some research.
Unlike simulation, we will never know what the true population statistic looks like.
So we need an idea, we need to hypothesize about the population.
Let’s imagine our obvious stop is our client, the car auction owner.
The owner tells us that the particular car was an unusual old car that caught the attention of a small group of buyers who wanted it badly.
They bid between themselves and drove the price unusually high.
Further, the owner assures us it was the only case in the company’s 20 something-year-old long history and it won’t happen again.
They deal with regular, everyday cars.
Armed with this knowledge we go back to our model and confidently remove the outlier.
In terms of model statistics that would mean we are confident that models’ residual standard error is 514 which further means we can assume the true error term standard deviation is close to that number (since the data is simulated we know that it is equal to 500).
This estimate of the population is crucial if we want to get the other point estimates right.
If the owner weren’t so confident we shouldn’t have decided to remove the outlier.
Instead, we could explore the used car market and try to find the external estimate for the true error term standard deviation.
If our model, or any other for that matter, ignored extreme values that we can’t confidently assume won’t occur in production, it would perform poorly when put in production.
Poorly removed outliers will lead to overly confident estimates.
ConclusionIt is tempting to ignore or remove outliers.
Don’t do it without a very good reason.
Always do external research in order to find hypothesized true population statistics.
This is probably easier said than done.
The real world is not as easy to figure out as simulated data in this example.
Never the less I argue that this example should be used as a motivation for further study.
It should motivate you to approach a real-world problem, not only with a math formula that determines which outliers in your data should be removed but to experiment and hypothesize on true population statistics.
Only a great insight into what the population looks like can answer the question which outliers to remove.
ReferencesDavid M Diez, Christopher D Barr, Mine Çetinkaya-Rundel (2015).
OpenIntro Statistics Third Edition.
orgGareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani (2017).
An Introduction to Statistical Learning with Applications in R.
Ariel Muldoon (2018).
Simulate!.Simulate! — Part 1: A linear model.
Andrew Gelman, John Carlin (2014).
Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors (2014).
Perspectives on Psychological Science Vol.
Brandon Foltz (2019).
Statistics 101: Linear Regression, Outliers, and Influential Observations.