Data Science Modeling: How to Use Linear Regression with PythonTaking a look at R², Mean Squared Error, and moreDean SublettBlockedUnblockFollowFollowingMay 24by Brian Henriquez, Chris Kazakis, and Dean SublettPhoto by Campaign Creators on UnsplashIntroduction and ObjectivesLinear regression is a widely used technique in data science because of the relative simplicity in implementing and interpreting a linear regression model.
This tutorial will walk through simple and multiple linear regression models of the 80 Cereals dataset using Python and will discuss some relevant regression metrics, but we do not assume prior experience with linear regression in Python.
The 80 Cereals dataset can be found here.
Here are some objectives:Understand the meaning and limitations of R²Learn about evaluation metrics for linear regression and when to use themImplement a simple and multiple linear regression model with the 80 Cereals datasetExploring the DataAfter downloading the dataset, import the necessary Python packages and the cereals dataset itself:Output from cereal.
head()Here we see that each row is a brand of cereal, and each column is a nutritional (protein, fat, etc.
) or identifying feature (manufacturer, type) of the cereal.
Notice that rating is the response or dependent variable.
Next, we created a pairs plot of the correlations between each feature of the dataset, and from this visualization we selected three predictor variables: calories, fiber, and sugars.
The plot displaying every correlation is too large to share here, but we can take a closer look with a smaller pairs plot that includes only our predictor variables.
pairplot, we can see three scatter plots with fitted least squares lines:Pairs plot of each predictor variable with the response variableNow that we’re more familiar with the data, we can begin setting up our linear regression models.
Note: For the sake of easily conveying the concepts, we do not calculate the R² and adjusted R² values using a test/train split of the data.
But please recognize that using a test/train split of randomly selected observations is considered the best practice, and this is how we present our errors and AIC/BIC near the end of the tutorial.
The Linear Regression ModelWe want to discuss R² and its significance to linear regression models.
But to understand exactly what R² is, first we need to understand what a linear model is.
Let’s look at a scatter plot comparing the calories in a serving of a cereal and its rating:Scatter plot of rating and caloriesWe can clearly see that servings of cereal with more calories generally receive poorer reviews.
If we assume there’s some relationship between these two variables, then we can construct a model that predicts a cereal’s rating based on the number of calories.
To verify that the relationship is, in fact, linear, we can plot the residuals of our model on a graph and look for patterns.
A clear pattern in the residual might suggest that another model, such as a quadratic or logarithmic, may better describe the relationship between the two variables.
Let’s check the residuals:There isn’t a clear pattern in the residuals, so there is no evidence that there is a better-fitting, non-linear equation.
For linear regression, we’ll be interested in the formula:x is the predictor variable for the response variable yTo make a model, we can use the scipy linregress method.
And we get the following output:LinregressResult(slope=-0.
06030617024600228)The first item is b_1, the second is b_0, and the third value is the R value, also known as the correlation coefficient.
The R value ranges from 1 to -1 and measures the strength of the relationship between explanatory variables and a response variable.
The R value for calories versus rating is -.
689, which shows there is a strong negative relationship between the two variables.
The further the R value is from 0, the better a model is at predicting values.
R²By squaring R, we get the coefficient of determination, R².
R² is a value that represents what percentage of the variation in the y variable can be explained by the variation in the x variable.
A high R² value indicates a stronger model.
Let’s look at some R² values in our dataset:We print the following:R² of model with Cup Predictor: 0.
0412740112014871R² of model with Calories Predictor: 0.
4752393123451636These R² values indicate to us that calories is a better predictor of rating than cups is.
Simple linear regression is useful, but, oftentimes we want to see how several variables can be used to predict a single variable.
Let’s get a 2D array of predictors from cereal by taking a slice from it with all our variable of interest.
calories, fiber, and sugar appeared to be good predictors when we reviewed the correlations pairs plot earlier, so let’s look at a model using those three:We get the following output:R²: 0.
8483669504178866R² Adjusted: 0.
8070124823500374We find that the R² value has increased to from .
475 in the one-variable model (with calories as the predictor), to .
This seems to indicate that the predictive power of our model has increased.
However, let’s add a poor predictor, cups, to this multiple linear regression model and see what happens:This code gives the following output:R²: 0.
8490487016343364R² Adjusted: 0.
788668182288071Recall that the number of cups per serving of cereal appeared to have almost no correlation with consumer rating in the single variable case.
But when we add it to the model, the overall R² increases to .
849, which implies the predictive power of the model improved.
However, based on what we know, this four-variable model should be no better than the three-variable model.
By virtue of how the R² value is calculated, adding more variables to a model will always increase the R² value.
So, we need to compare the adjusted R² values, which mitigates the increase of R² due to additional variables.
The formula for the adjusted R² isN-total sample size, p-number of predictorsUsing this we find that the three-variable model has an adjusted R² of .
807, while the four-variable model has an adjusted R² of .
Therefore, the three-variable model is better by this metric.
R² is one of the most important metrics for evaluating how well a linear model fits data, so it’s important to have an intuitive understanding of what it means.
Knowing the limitations of R² and how those limitations can be mitigated is equally as important when implementing linear regression models.
Mean Squared Error (MSE)Regression models have a number of different evaluation metrics.
One of the most popular metrics and what we’ll discuss first is mean squared error (MSE).
SourceMSE is an evaluation metric that measures the average of the squared differences between the observed and predicted values.
In other words, MSE tells us how accurate or inaccurate our linear regression model is — the lower the MSE, the “better” the model is at predicting values.
Let’s find the MSE of our regression model:Our variable mse returns as 26.
Another evaluation metric at our disposal is root mean squared error (RMSE), which is simply the square root of our MSE.
Using the square root function from the Python math module, sqrt(mse) returns as 5.
It’s important to note that our RMSE value shares the same units as the response variable (we took the square root of squared errors).
Our RMSE value of 5.
1607 falls relatively low on the 0–100 range of the rating variable, so our multiple linear regression model is “good” at predicting the rating of a cereal brand.
But there are other errors we may use.
Mean Absolute Error (MAE)The next regression evaluation metric we’ll consider is mean absolute error (MAE).
SourceSince MSE squares the difference of the residuals, larger disparities between actual and predicted values are “punished” more harshly by MSE than by MAE.
Because of the squared terms, MSE is more sensitive to outliers than MAE is.
If we decided that outliers in our dataset were not significant in analyzing the data, we may turn to MAE before MSE since the outliers’ residuals would not be exaggerated from the squaring of the residuals.
Let’s find the MAE:Our mae variable returns 3.
Our MAE is relatively small given the 0–100 range of rating, so our MAE indicates that our model is reasonably accurate in its predictions.
Mean Absolute Percentage Error (MAPE)The final regression evaluation metric we’ll consider is the mean absolute percentage error (MAPE).
SourceMAPE gives the accuracy of predictive models as a percentage.
Notice the similarity in the MAE and MAPE formulas.
Like MAE, MAPE is not greatly influenced by outliers.
However, utilize MAPE with caution becauseMAPE is prone to division by zero errors (see the denominator within the summation);MAPE can grow very large if the actual values are very small (again, see the division operation in the summation);MAPE is biased toward predictions that are smaller than the observed values.
Let’s find the MAPE for our model:Our MAPE function returns the following percentage: 8.
So our forecast is “off” by about 8.
5% on average.
AIC and BICAIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are objective methods of evaluating your regression models and determining the best subset of predictors (which model fits better).
When you add parameters to your model, it will always fit a little better.
But then you have the risk of losing information on the real underlying pattern.
Thus, there is a trade-off between the number of parameters and the amount of error your model accounts for.
AIC and BIC evaluate the ability of the models to account for additional variation in the variable you’re predicting but without overfitting the model.
AICAIC allows you to estimate the amount of information lost in your models so that you can compare which models work best and pick the more appropriate subset of predictors.
More specifically, the AIC value looks at the relative distance between the true likelihood function of the data and the fitted likelihood function of your model.
The smaller that distance, then the closer the model is to the true representation of your data.
AIC is denoted by this formula:N-number of observations, K-number of parameters fit + 1If we compare the fit of two models using the AIC method, the model with the lower AIC value has the better fit.
Let’s find the AIC values of our two multiple regression models that we used earlier.
One has three predictors and the other has four.
First we’ll define the values we’ll plug in to the formula, and then we’ll run the formula:This gives the following output:AIC of Model with Three Predictors: 60.
51438447233831AIC of Model with Four Predictors: 62.
31365180026097From what we see, the model with three predictors has a lower AIC value and thus is a better fit than the model with four predictors (but not by much in this example).
BICBIC is similar to AIC, but it is much stricter in terms of penalizing your model for adding more parameters.
It is denoted by this formula:N-number of observations, K-number of parameters fit + 1If we compare the fit of two models using the BIC method, the model with the lower BIC value has the better fit, similar to the process for the AIC method.
Let’s find the BIC values for the same two models we just used.
Here the only difference is the penalty we multiply the number of parameters by:This gives the following output:BIC of Model with Three Predictors: 63.
60473936129743BIC of Model with Four Predictors: 66.
17659541145987From what we see here, the model with three predictors has a lower BIC value and thus is a better fit than the model with four predictors.
Since the BIC penalty is stricter than the AIC penalty, the values for the BIC method are larger than the AIC method for their respective models.
Due to the difference in penalty, AIC can choose a model with more parameters than BIC.
It’s recommended that you use AIC and BIC together and make decisions on your models based on both sets of results.
In this case, the AIC and BIC agreed with each other and chose the same models.
Key VocabularyTo sum up, we discussedR²: an indicator of how strong the linear regression model predicts the response variableAdjusted R²: an indicator of how strong the multiple linear regression model accounts for the variance in the dependent variable while correcting for the number of parameters in the modelMSE (mean squared error): an evaluation metric that greatly punishes outliers; probably the first error you’ll calculate and used when outliers represent a true phenomenon of the datasetRMSE (root mean squared error): the square root of MSE; shares the same units as the response variable, so RMSE may be more “interpretable” than MSEMAE (mean absolute error): an evaluation metric used to lessen the significance of outliers when measuring error; used when outliers do not represent a true phenomenon of the datasetMAPE (mean absolute percentage error): a measure of a regression model’s accuracy as a percentage; prone to runtime errors or unusually large values when the response variable takes on small valuesAIC (Akaike Information Criterion): an evaluation of the amount of information lost in different models that penalizes for an increase in parameters.
Regardless of the size of your data, it always has a chance of choosing too big of a model.
Best used in conjunction with BIC.
BIC (Bayesian Information Criterion): similar to AIC but penalizes more heavily.
Regardless of the size of your data, it always has a chance of choosing too small of a model.
Best used in conjunction with AIC.
ConclusionIn this tutorial, we showed how to implement simple and multiple linear regression models with Python and the different methods of evaluating these models and their error.
When working with your own datasets, you can choose to use any of these methods to evaluate your regression models and error.
However, it may be in your best interest to use a number of these and see how their results align or differ in order to decide which of your models has the best representation of your data.
Now you should be more comfortable with implementing your own linear regression models and be more aware of the similarities and differences among all the discussed regression metrics.