Is the relationship best fit with a linear regression?Source of original data: Penn State.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -First, let’s bring in the data and a few important modules for the analysis:%matplotlib inlineimport numpy as npimport pandas as pdfrom sklearn.
metrics import r2_scoreimport matplotlib.
pyplot as pltfrom scipy import statsimport seaborn as snsdata = pd.
xlsx”)df = data[[‘age’,’length’]]dfThere are 77 instances in the data set.
Below is the head of the original data:Now let’s visualize the scatter-plot.
We’ll be trying to predict length from age, so the axes are in their respective positions:x = df['age']y = df['length']plt.
title('Scatterplot of Length vs Age')plt.
show()As you can see, there looks to be a pattern.
As the fish get older, there definitely seems to be a correlation to length.
To gain further perspective on the data, let’s break down each axis’s information into its own univariate distribution histogram.
Y — Length of Fish:stdy = y.
std()meany = y.
distplot(x);print ("The std of y is: " + str(stdy) + " The mean of y is: " + str(meany))X — Age of Fish:stdx = x.
std()meanx = x.
distplot(x);print ("The std of x is: " + str(stdx) + " The mean of x is: " + str(meanx))We can see that the average length of the fish population is 143.
6 cm and the average age is 3.
The standard deviations are a bit different, with age being a bit more volatile than length.
Linear Regression:A linear regression is one of the simplest forms of predictive models.
To put simply, a linear regression measures the relationship between two variables by fitting a linear equation to a dataset.
One variable is considered to be an explanatory variable (age), and the other is considered to be a dependent variable (length).
It does this by minimizing the distance between the line and every data point.
Below, we’re using the scipy stats module to calculate the key metrics for our line.
Outputted beneath are the intercept and the slope, respectively.
slope, intercept, r_value, p_value, std_err = stats.
linregress(x,y)intercept,slopeNow we will plot this line on the original scatterplot:def linefitline(b): return intercept + slope * bline = linefitline(x)plt.
title('Scatterplot of Length vs Age – Linear Regression')plt.
plot(x,line, c = 'g')plt.
show()r2_lin = r_value * r_valueprint('The rsquared value is: ' + str(r2_lin))The rsquared value for this regression line shows that the age of the fish explains 73% of the change in length.
R-squared explains to what extent the variance of one variable impacts the variance of another variable.
Other independent variables that could explain the remaining 27% variance in length could be food availability, water quality, sunlight, fish genetics, etc.
If we had the data on all those attributes, we could run a multivariable regression and have a more predictive model.
But, alas, we live in a world with limited data.
Intuitively, our simple linear model doesn’t go through the middle point of each cluster of y points.
For years 1, 2 , 5 and 6, it goes above the median of the cluster.
This is most likely because the majority of the population is in years 3 and 4, and this shifts the line upwards.
It seems like the optimal line would need to curve to match the data more accurately.
That’s where the polynomial regression comes in.
Polynomial Regression:Simply put, polynomial regression models can bend.
They can be constructed to the nth-degree to minimize squared error and maximize rsquared.
Depending on the nth degree, the line of best fit can have more or less curves.
The higher the exponent, the more numerous the curves.
Below we have some code to create the new line and to graph it on our scatterplot.
The part p = np.
polyfit(x,y,2)) is where we can adjust the nth degree.
This line is a quadratic function because it’s only raised to the second power.
As we can see on the plot below, the new polynomial model matches the data with more accuracy.
x = df['age']y = df['length']plt.
title('Scatterplot of Length vs Age')p = np.
polyfit(x,y,2))xp = np.
plot(xp,p(xp),c = 'r')plt.
show()r2 = r2_score(y, p(x))print('The rsquared value is: ' + str(r2))The rsquared value is 0.
80 compared to the 0.
73 value we saw in the simple linear model.
This means that 80% of length is explained by their age in this new model.
We can now experiment in changing the nth value of our model to see if we can find a better-fit line.
However, we must keep in mind that over-fitting is a risk we will face the higher we go.
Below is an example of a polynomial raised to the 6th degree:x = df['age']y = df['length']plt.
title('Scatterplot of Length vs Age')p = np.
polyfit(x,y,6))xp = np.
plot(xp,p(xp),c = 'b')plt.
show()It caters a bit too extreme to the outliers and is too closely aligned to the data.
The rsquared value for this model is 0.
804, not too much higher than the quadratic model.
For this data set, most would agree that the quadratic function matches best.
Thanks for reading!.Hope this helped.
.. More details