A linear relationship.
There looks to be a positive correlation between Richmond house prices and their building area.
This gives us the go ahead to begin building our linear regression model.
On to the next one!2.
Applying a linear regression modelIn this section we attempt to fit a linear regression model to the observed relationship between Richmond house prices and their building area.
We will clean our data for any outliers or erroneous points and then make sense of the regression output.
First, let’s visualise our data with a box plot.
Figure 4: Box plot of building area for Richmond housesWe can establish the following points from Figure 4:There looks to be a minimum value of 0 m²There are 5 outliers from 200 m² onwardLet tackle point 1 first, it’s quite peculiar that that there is a house in Richmond with a Building Area of 0 m², so let’s have a look.
Table 5: House with Building Area of 1 m²The Building Area is in fact 1 m² but has a Land size of 0 m².
After investigating this property online it is clear this isn’t a pantry, but much larger than that.
Hence this data point is incorrect, so let’s filter it out.
Next we have the 5 outliers from 200 m² onward.
To identify these outliers we use the 1.
5 x IQR rule.
Table 6: Richmond house outliers based on Building AreaAlthough these houses are deemed high outliers based on the IQR computation, I am going to keep these in the data set.
Reason being, is if we return to the scatter plot in Figure 3, these values are important in capturing the linear relationship between house price and building area.
Now that we have cleaned our data and assessed all the outliers, we can build our regression model using least squares regression!Figure 5: Linear Regression PlotWe can see our regression line fits quite nicely through the data, so let’s see the results:Table 7: OLS Regression ResultsOur regression model is as follows:Price ($) = 14140 + 10620*BuildingAreaKey takeaways from OLS regression results:Based off the adjusted R-squared, our regression model accounts for 70.
6% of the variability in Price around its mean.
We have an intercept of 14,140, which according to a significance level of 0.
05 is not statistically significant (p-value = 0.
The coefficient of Building Area is 10,620, which is statistically significant (p-value is very small).
Let’s elaborate on each point in detail.
Takeaway 1:Although we have identified that our model explains 70.
6% of the total variance in Price around it’s mean, we cannot simply leave it at that.
It is important we check the residual plot, to examine if any of the explanatory power of our model exists in our residuals, as these are meant to be random and unpredictable.
With regression there exists two components, a deterministic and stochastic component.
For the stochastic component, this means the difference between our observed prices and actual prices should be random.
Figure 6: Standardised Residual PlotWe can see that the points on the residual plot are scattered randomly around zero, and we are unable to discern any pattern.
This means we cannot use one price residual to determine the value of the next price residual, hence there exists no explanatory power.
(this is good!)Takeaway 2:We have a statistically insignificant intercept of 14,140, how do we interpret this?Well for our model, the intercept is the price of a house in Richmond when we have a building area of 0 m².
This means for a house that has no floor area, it’s price will be $14,140.
My belief is that whether or not the intercept is significant, it should be neglected, as a house with no floor area selling for $14,140 is unrealistic.
However, for scenarios when the building area becomes more realistic e.
30 m², the intercept will be important in establishing a baseline house price.
Takeaway 3:We find that the coefficient of Building Area is 10,620 and is statistically significant.
How do we interpret this?Firstly, as the coefficient is greater than zero and is statistically significant, there is a positive relationship between Richmond house prices and their building area.
Secondly, with all other variables held constant, a one unit increase in Building Area, increases the price of a house in Richmond by $10,620.
However, it is important we take into account the uncertainty of the regression coefficient (i.
the 95% confidence interval), so we could express it as:Holding all other variables constant, increasing the floor area by 1 m², will approximately increase the price of a house in Richmond by $9,524 to $11,700.
Finally…I hope this article has given you an understanding of the processes involved in preparing the data for linear regression, and also how to make sense of the results.
This example was used to illustrate an application of simple linear regression, but in future articles I plan to apply different methods such as multiple linear regression.
If you have any feedback on my work, how I can improve or anything you feel is incorrect, please let me know!.Thank you for reading!.