Simulating(Replicating) R regression plot in Python using sklearnvikashraj luhaniwalBlockedUnblockFollowFollowingFeb 22When it comes to data science and machine learning workloads, R and Python are the most popular and powerful languages.
Python is often treated as a general-purpose language with an easy-to-understand syntax, whereas R for statistical analysis with an availability of around 12000 packages.
There are dozens of articles available comparing Python and R from a subjective point of view.
I am not going to favour here one language over the other.
In this post, we will discuss the replication of R regression plots in Python using sklearn.
Most of the R’s functionality can be easily and directly converted in Python, but some are surprisingly hard to find equivalents without using custom functions.
plot () function for Regression models in R does not have a direct equivalent for all the plots in Python.
Let us discuss it with faithful dataset available in R.
The dataset contains 272 observation of two variables eruptions (Eruption time in minutes) and waiting (Waiting time to next eruption).
This dataset reveals Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
Now let us fit a simple linear regression model in R to this dataset for predicting waiting time based on eruption time.
Each of the above plots has its own significance for validating the assumptions of linearity.
We are not going to deep dive here for the same.
Let’s focus on Python code for fitting the same linear regression model.
Import all the necessary libraries and load the required data.
Now let’s fit a linear regression model on faithful dataset using sklearn.
Now let us focus on all the regression plots one by one using sklearn.
Residual plotIt’s the first plot generated by plot() function in R and also sometimes known as residual vs fitted plot.
It is useful in validating the assumption of linearity, by drawing a scatter plot between fitted values and residuals.
If the plot depicts any specific or regular pattern then it is assumed the relation between the target variable and predictors is non-linear in nature i.
And no pattern in the curve is a sign of linearity among the selected features and the target variable.
This same plot in Python can be obtained using residplot() function available in Seaborn.
Here, the first and second argument points to fitted(predicted) values and target variable respectively.
lowess=True ensures lowess(smoothened) regression line is drawn and using line_kws argument we can customize the attributes of this line.
QQ plotThis plot depicts whether the residuals(errors) are normally distributed or not.
If the points lie close to the normal line then residuals are assumed to be normally distributed.
In Python, this same plot can be achieved using probplot() function available in seaborn.
Here, the residuals are passed as an argument to the function.
Scale-Location plotGenerally, it is used to guess homoscedasticity of residuals.
It is a plot of square-rooted standardized residual against fitted value.
If it depicts no specific pattern then the fitted regression model upholds homoscedasticity assumption.
This same plot in Python can be obtained using regplot() function available in Seaborn.
Here, the first and second argument points to fitted values and square-rooted standardized residuals respectively.
Leverage plotGenerally, it is used to guess the impact of outliers over the regression fit.
Currently, I could not figure out how to draw the same in Python for a sklearn based fitted model.
Once I would be able to figure out, I will be updating the same.
Using statsmodels it is quite easy to draw using the built-in leverage plot but I am not going to discuss it over here.
If you know already how to draw this same plot for a sklearn based fitted model.
Let me know in the comments and I’ll add it in!.