R-Squared RecipeCalculating R-squared from scratch.
Andrew HBlockedUnblockFollowFollowingJun 18Source of image: linkR-squared can be abstruse to students learning data-science.
Instead of introducing the mathematical formulas involved, I thought it may be refreshing to show how it’s intuitively calculated from scratch and explain each step in plain English.
I’m going for a cooking-show theme here.
Ingredients:Dataset which contains at least 1 serving of independent variable (X) and exactly 1 serving of dependent variable (Y)Drizzle a linear regression line on top of the data.
Drizzle a horizontal line of the average Y on the data.
Python, another software, or even just a pen and paper.
Dataset and Linear RegressionX , in this example, will be integers from 0–9; Y will be the first 10 digits of the Fibonacci sequence.
After plotting the datapoints, we will drizzle our Ordinary Least Squares regression (OLS) line ontop.
More info about linear regressions can be found here.
But the main idea behind OLS is that it’s trying to predict where the Y’s will be for each X.
Our dish is looking good so far!Horizontal Average Y LineNext, we will splash another line on our data.
This line will be used to measure total variance later.
If we theoretically only had Y data (and no X), the best predictive model we would be able to make would be to guess the average of Y every time.
This is a key step in cooking up r-squared, as you will see in a minute.
Squared differences between the actual data points and each lineIf we measure the difference between each point of data and the linear regression line, square each difference and add them up, we will get the variance that exists within the regression model.
This is calculated below.
If we measure the difference between each point of data and the horizontal line, square each difference, and add them up, we will get the total variance that exists in the model.
Final Step and CheckBecause r-squared is defined as the “proportion of the variance for a dependent variable that’s explained by an independent variable”.
All we have to do is put the difference of the 2 variances over the total variance to find rsquared: (7524–1753) / 7524 = 0.
Not a bad score!Just to make sure we did it correctly, let’s check our answer with the sklearn “r2_score” method:We have a match.
Hope this helped.