Plenty of times, I suppose?Now, how many times have you used the statsmodels library to examine the model by running goodness-of-fit tests?It is very common in a Python-based data science learning track, to go like this,The answer to the question “Is something missing” is yes!Often, there is plenty of discussion about regularization, bias-variance trade-off, or scalability (learning and complexity curves) plots.
But, is there sufficient discussion around the following plots and lists?Residuals vs.
predicting variables plotsFitted vs.
residuals plotHistogram of the normalized residualsQ-Q plot of the normalized residualsShapiro-Wilk normality test on the residualsCook’s distance plot of the residualsVariance inflation factor (VIF) of the predicting featuresIt is clear that you have to wear the hat of a statistician, not only a data mining professional, for this part of the machine learning pipeline.
The issue with Scikit-learnIt can be safely assumed that the majority of statisticians-turned-data scientists run the goodness-of-fit tests regularly on their regression models.
But many young data scientists and analysts depend heavily, for data-driven modeling, on ML-focused packages like Scikit-learn, which, although being an awesome library and virtually a silver bullet for machine learning and prediction tasks, do not support easy and fast evaluation of model quality based on standard statistical tests.
Therefore, it is imperative that good data science pipeline, in addition to using an ML-focused library like Scikit-learn, include some standardized set of code to evaluate the quality of the model using statistical tests.
In this article, we show such a standard set of evaluations for a multivariate linear regression problem.
We will use the statsmodels library for regression modeling and statistical tests.
A brief overview of linear regression assumptions and the key visual testsThe assumptionsThe four key assumptions that need to be tested for a linear regression model are,Linearity: The expected value of the dependent variable is a linear function of each independent variable, holding the others fixed (note this does not restrict you to use a nonlinear transformation of the independent variables i.
you can still model f(x) = ax² + bx + c, using both x² and x as predicting variables.
Independence: The errors (residuals of the fitted model) are independent of each other.
Homoscedasticity (constant variance): The variance of the errors is constant with respect to the predicting variables or the response.
Normality: The errors are generated from a Normal distribution (of unknown mean and variance, which can be estimated from the data).
Note, this is not a necessary condition to perform linear regression unlike the top three above.
However, without this assumption being satisfied, you cannot calculate the so-called ‘confidence’ or ‘prediction’ intervals easily as the well-known analytical expressions corresponding to Gaussian distribution cannot be used.
For multiple linear regression, judging multicollinearity is also critical from the statistical inference point of view.
This assumption assumes minimal or no linear dependence between the predicting variables.
Outliers can also be an issue impacting the model quality by having a disproportionate influence on the estimated model parameters.
What plots are can be checked?We can never know the true errors, no matter how much data we have.
We can only estimate and draw inference about the distribution from which the data is generated.
Therefore, the proxy of true errors are the residuals, which are just the difference between the observed values and the fitted values.
Bottom line — we need to plot the residuals, check their random nature, variance, and distribution for evaluating the model quality.
This is the visual analytics needed for goodness-of-fit estimation of a linear model.
Apart from this, multicollinearity can be checked from the correlation matrix and heatmap, and outliers in the data (residual) can be checked by so-called Cook’s distance plots.
Example of regression model quality evaluationThe entire code repo for this example can be found in the author’s Github.
We are using the concrete compressive strength prediction problem from the UCI ML portal.
The concrete compressive strength is a highly complex function of age and ingredients.
Can we predict the strength from measurement values of these parameters?Scatterplot of variables to check for linearityWe can simply check the scatterplot for visual inspection of the assumption of linearity.
Pairwise scatter plots and correlation heatmap for checking multicollinearityWe can use the pairplot function from the seaborn library to plot the pairwise scatterplots of all combinations.
Furthermore, if the data is loaded in Pandas, we can easily compute the correlation matrix and pass that onto the special plotting function of statsmodels to visualize the correlation as a heatmap.
Model fitting using statsmodel.
ols() functionThe main model fitting is done using the statsmodels.
It is an amazing linear model fit utility which feels very much like the powerful ‘lm’ function in R.
Best of all, it accepts R-style formula for constructing the full or partial model (i.
involving all or some of the predicting variables).
You may question, in the age of big data, why bother about creating a partial model and not throw all the data in?.That is because confounding or hidden bias may be present in the data which can be addressed only by controlling for certain factors.
In any case, the summary of the model fitted through this model already provides rich statistical information about the model such as t-statistics and p-values corresponding to all the predicting variables, R-squared, and adjusted R-squared, AIC and BIC, etc.
predicting variables plotsNext, we can plot the residuals versus each of the predicting variables to look for independence assumption.
If the residuals are distributed uniformly randomly around the zero x-axes and do not form specific clusters, then the assumption holds true.
In this particular problem, we observe some clusters.
residuals plot to check homoscedasticityWhen we plot the fitted response values (as per the model) vs.
the residuals, we clearly observe that the variance of the residuals increases with response variable magnitude.
Therefore, the problem does not respect homoscedasticity and some kind of variable transformation may be needed to improve model quality.
Histogram and Q-Q plot of normalized residualsTo check the assumption of normality of the data generating process, we can simply plot the histogram and the Q-Q plot of the normalized residuals.
Additionally, we can run the Shapiro-Wilk test on the residuals to check for the Normality.
Outlier detection using Cook’s distance plotCook’s distance essentially measures the effect of deleting a given observation.
Points with a large Cook’s distance need to be closely examined for being potential outliers.
We can plot the Cook’s distance using a special outlier influence class from statsmodels.
Variance influence factorsThe OLS model summary for this dataset shows a warning for multicollinearity.
But how to check which factors are causing it?We can compute the variance influence factors for each predicting variable.
It is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone.
Again, we take advantage of the special outlier influence class in statsmodels.
Other residuals diagnosticsStatsmodels have a wide variety of other diagnostics tests for checking model quality.
You can take a look at these pages.
Residual diagnostics testsGoodness-of-fit testsSummary and thoughtsIn this article, we covered how one can add essential visual analytics for model quality evaluation in linear regression — various residual plots, normality tests, and checks for multicollinearity.
One can even think of creating a simple suite of functions capable of accepting a scikit-learn type estimator and generating these plots for the data scientist to quickly check the model quality.
Currently, although scikit-learn does not have detailed statistical tests or plotting capabilities for the model quality evaluation, Yellowbrick is a promising Python library which can add intuitive visualization capability on scikit-learn objects.
We can hope that in the near future, statistical tests can be added to scikit-learn ML estimators directly.
If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.
Also, you can check the author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources.
If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.
Tirthajyoti Sarkar – Sr.
Principal Engineer – Semiconductor, AI, Machine Learning – ON…Georgia Institute of Technology Master of Science – MS, Analytics This MS program imparts theoretical and practical…www.