Photo by Campaign Creators on UnsplashHow to Tackle Your Next Regression ProblemThe Key Ideas to Consider for Your Next Regression ModelTom AllportBlockedUnblockFollowFollowingMay 24Tom Allport, Jammie Wang, Jack StalfortThis article is going to walk through a few of the ideas you need to be thinking about when building a regression model.
In order to produce a well fitting regression model for your data there are several assumptions you need to check, as well as testing for outliers and multicollinearity.
After reading this article you should have a better of idea of how to implement a well fitting regression model.
When looking for relationships in your data, it’s no good just fitting a regression and hoping for the best.
Linear regression has many assumptions that need to be met in order for the model to be accurate.
Some of the common assumptions are:Linearity: A linear relationship exists between the dependent and predictor variables.
If no linear relationship exists, then linear regression is an inaccurate representation of the dataNo multicollinearity: Predictor variables are not collinear, meaning they aren’t highly correlatedNo autocorrelation (serial correlation in time): Autocorrelation is when a variable is correlated with itself across observationsHomoscedasticity: There is no pattern in the residuals, meaning that the variance is constantNormally distributed: Residuals, independent, and dependent variables must be normally distributedResidual average is zero, indicating that data is evenly spread across the regression lineDataset DistributionsThe first assumption that a linear regression model makes is the independent variable or predictor (X) and the dependent variable or outcome (y) have a linear relationship.
Linear regression assumes that the independent and dependent variables are normally distributed, which can be checked in many different ways; the most simple way is by observing a histogram of the values.
Other ways of testing normality include a goodness of fit test, Shapiro–Wilk test, which are numerical tests, or generating a Q-Q plot to test normality graphically.
Sometimes, a sample size of 30 or greater is the threshold for determining normality due to the Central Limit Theorem, which states that adding independent and identically distributed random variable will result in a normalized distribution when the sample size is at least 30, even if the random variable themselves are not normal.
However, there is no true right answer for the minimum sample size (lots of people use 50 + 8k where k is the number of parameters in the model).
The sample size is also a trade off between fitting well and time/cost.
Not enough samples will result in a regression line that doesn’t fit well, but more samples require time and money to be invested into gathering data.
However, the predictor and response variables do not necessarily have to be normal.
A data set that is not normally distributed can be fixed by applying a transformation to it in order to make it normal.
If a data set is skewed due to the presence of outliers, then these points must also be appropriately dealt with to restore normality, which we can do with distribution plots.
Residuals in linear regression are assumed to be normally distributed.
A non-normal residual distribution is the main statistical indicator that there is something “wrong” with the data set, which may include missing variables or non-normal independent/dependent variables.
Furthermore, homoscedasticity of the residual variance is also assumed; the residuals are assumed to have a constant variance.
This is important because it demonstrates that the errors are evenly distributed, meaning the linear regression is the “best” fit for all points in the data set.
An outlier can affect the normality of the residuals because each data point moves the line towards it.
Therefore, looking at the residuals is the best indicator of how good of a fit that linear regression line is, and what we should do to fix it if there are any normality issues.
OutliersAn outlier is a point in a data set which does not follow the expected trend of that data set.
Outliers have the ability to change the slope of the regression.
For linear regression this change in slope will have an effect of the output as the effect of the variable has been distorted by the point.
When considering outliers we need to consider their influence, which is determined by the leverage and residual of the point.
Leverage is the distance the point is away from the centre of the data set, the further the point is from the centre the higher the leverage and is a number between 0 and 1.
Points with low leveragePoint on far right has high leverageLooking at the plots above, we can see that the furthest right point in Fig 1 has a smaller leverage than the furthest right point in Fig 2 as the distance from the centre of the points is greater.
In order to see if a point is an outlier we also check the residual value for the point.
Residuals are defined for each point in the data set as the distance of the point from the true value.
However, these values vary based on the data and it can be easier to interpret the studentized (standardized) residuals.
Influence is the effect of the point on the slope of the regression.
Points with high leverage have the potential to have greater influence on the slope of the regression.
Consider two people sat on a seesaw, the further the person is sat away from the centre the easier it is for the person to move up and down on the seesaw but the mass of the person also matters.
The size of the data set also effects the influence of the outlier, with larger data sets the impact of an outlier is reduced.
A common methods of calculating a points influence is Cook’s Distance.
Once you have determined there is an outlier in the data set there several methods to deal with them effectively.
The first option is to remove the point from the data set.
This has the benefit of removing the influence of the point completely however, if the data set is small or the point is of high leverage, alternative methods should be considered as it may also introduce bias into the regression and find a false relationship in the data.
Another method is to assign a new value to point, such as the mean of similar values, which is known as imputation.
This method is typically used when the value is known to be incorrect and can be beneficial when the data set used is small as it doesn’t remove values.
An alternative is to transform the entire data set rather than using the recorded values.
This can be useful when the outlier is known to be the true value of the point and the point has high leverage.
Diagnostic PlotsDiagnostic plots commonly graph residual trends in 4 different ways and are incredibly useful for verifying that the assumptions for linear regression have been met.
In this section, we will cover each plot.
Residuals vs Fitted PlotThe residual vs fitted plot is mainly used to check that the relationship between the independent and dependent variables is indeed linear.
Good residual vs fitted plots have fairly random scatter of the residuals around a horizontal line, which indicates that the model sufficiently explains the linear relationship.
In the figure above, Case 1 is a good example of a “good” model as the line through the points is close to horizontal.
Case 2 is clearly a parabolic arch, which indicates that there’s some other relationship between the independent and dependent variables that hasn’t been explored yet.
Normal Q-Q PlotThe normal Q-Q plot examines the normality of the residuals.
Given that one of the main assumptions for a good linear regression model is the normality of the residuals, this plot can tell us a lot about the relationship between the variables, as discussed before.
A “good” linear regression model will have residuals that closely follow the dashed line, like in Case 1.
Clearly, there is something significantly wrong with Case 2 since the residuals do not follow that linear pattern, and are exponential instead.
Scale-Location PlotAlso called the Spread-Location plot, the Scale-Location plot examines the homoscedasticity of the residuals.
As discussed before, verifying that the variance of the residuals remains constant ensures that a good linear regression model has been produced.
A “good” model will have random scatter around a horizontal line, indicating that the variance has remained constant.
Case 1 is a good example of this, while Case 2 shows that the variance of the residuals gradually increases as the points slowly grow farther apart.
Residuals vs Leverage PlotThe Residuals vs Leverage plot allow us to find cases that significantly impact the linear regression line.
To do this, the plot generates a line that indicates Cook’s distance, which is defined as the sum of all the changes in the regression model when an observation is removed from it.
Any point outside of Cook’s distance is considered to be influential.
The general cutoff for Cook’s distance is 0.
The larger the Cook’s distance score for a point, the more influential it is to the regression line.
This method is very good at identifying outliers and leverage points.
When looking at the plot, we simply want to remove any data points located outside of Cook’s distance.
For example, in Case 1, there are no points outside of the dashed line indicating Cook’s distance, so there is no need to remove any outliers or leverage points.
However, in Case 2, there is a data point far outside of the Cook’s distance threshold, indicating that it is a point of high influence and should be dealt with.
MulticollinearityOne big assumption about linear regression models is that our predictor variables are independent.
That means that each predictor variable doesn’t depend on any other independent variable.
Only the response variable does that, which is why it is also called the dependent variable.
However, say we have a model and we graph 2 of our predictors against each other:These variables don’t look independent, but we assumed that they were.
So our assumptions have not been met.
When two or more variables we assume to be independent are actually dependent on each other, we have multicollinearity.
There are two main types of multicollinearity.
One is called “structural multicollinearity”.
This is when we make new predictor variables based on old ones.
A common example of this would be making a predictor variable X² from X – clearly X² will be correlated with X.
So how do we fix this?.One way is to remove X² from the model.
But some quadratic models need this squared component.
So we do something called centring the predictors.
To centre the predictors we find the mean of X and subtract it from the values of our observations.
This causes approximately half of the observations to have negative values, but when we square X we will still get a positive value.
So now the graph of X versus X² is in the shape of a parabola, lowering R² between the two predictors.
The more common form of multicollinearity is called “data based multicollinearity”.
It happens when seemingly uncorrelated variables are correlated.
In the first form of multicollinearity, X² was correlated with X as is it is a function of X.
However, if variables we expect to be independent, such as pulse and body square area, are correlated, then we have data based multicollinearity.
This type is the most common and problematic, and happens for two main reasons.
One is when the people collecting the data rely solely on observational data instead of having a mix of observational data and experimental data.
The other, more broad, case happens with a poorly designed experiment, meaning that the researchers collect data in a way that certain variables might be influencing each other.
So we know what multicollinearity is, but why is this a problem for our model?.Say we are trying to predict y with variables X₁, X₂, and X₃.
If we made a model based on just X₁ and X₂ we get y = 4 + X₁+ 3X₂.
If instead we made a model based on X₁ and X₃, we get y = 5 – 5X₁ + 7X₃.
Notice how the coefficients all change?.These models still predict y accurately, but they go about it in a very different way.
So the major flaw multicollinearity brings to your model is that one variable’s influence cannot be isolated.
More concisely, multicollinearity doesn’t affect the accuracy of the model, it makes it hard to find the effect one variable has.
In our first model an increase of one unit in X₁ caused y to increase by one unit, but our second model had an increase of one unit in X₁ decreasing y by 5 units.
So if your goal is to just make a model to predict something, multicollinearity can be ignored.
However, if you want to understand each variables’ effect on the dependent variable, multicollinearity needs to be addressed.
So we know what multicollinearity is and how it affects our model, but how do we detect it?.One way is to make a plot of the correlation coefficients between each variable and look for high ones.
But this method assumes one variable can only be dependent on just one other one.
What if a more complex relationship exists between multiple predictors, like X₁= 2X₂ + 3X₃?.A better way of detecting multicollinearity is a method called Variance Inflation Factors (VIFs).
If multicollinearity is present, the coefficients will vary drastically between each model.
If a model is made with just one independent variable, the variance of that variable’s coefficient is at a minimum, and it is called the baseline variance.
If we then regress that independent variable against all other independent variables, another model can be created.
This model will have a certain R² value.
It can be shown that the baseline variance increases by a factor given by this equation involving the R² of this large regression.
This is called a VIF, and it can be calculated for each independent variable.
A VIF less than four is considered good: assumed no multicollinearity for that variable.
A VIF between 4 and 10 indicates multicolinearity might be present between variables and might need some investigation.
A VIF above 10 is a strong indicator of multicollinearity between variables in your model.
These VIFs can be used to correct multicollinearity.
One way to do this is to remove some of the predictors with a high VIF.
If these variables just cannot be removed because we have scientific reason to keep them in there, we still have some options.
We could go collect some more data for the predictors with a high VIF, and then recreate the VIFs based on this new and more complete data set.
If that still doesn’t change it then we will have to remove some of those variables.
ConclusionNow we have seen several methods that you should consider when fitting your regression model that will help improve its preformance.
Key VocabResiduals, fitted values, QQ-plot, multicollinearity, sample size, leverage, influence, outliers, Cook’s Distance, VIF.. More details