We can start with this.

This heatmap shows the correlations, both direct and inverse, between all our different variables.

We want as little correlation between our predictors as possible, i.

e.

to keep only the lightest colored cells.

So for starters lets just remove the worst culprits of collinearity, especially if they don’t even correlate well with price in the first place.

Much better.

You’ll notice however that we still have some clusters of red, particularly in the intersection of price, bedrooms, bathrooms, and sqft_living.

These are unfortunately also our best predictors of price.

We removed many variables with weaker price predicting capability and stronger collinearity, but we need at least some strong predictors of price, so this is as far as we dare go.

As we now move into training and testing our first models, we’ll be making two noteworthy tweaks to our data that will increase the accuracy of our predictions significantly.

The first is to subset our data to the bottom 90th percentile of house prices.

The larger percentiles include many outliers of price which would inevitably skew our model’s predictions, so we make a judgement call here: better to predict most home prices with better accuracy than all home prices with skewed accuracy.

The second change is to predict for the logarithm of price instead of price itself.

The reason for this comes down to our algorithm of choice: linear regression.

Linear regression assumes that your data has a fairly normal distribution, i.

e.

a central cluster that you can run a straight line through in order to get fairly close to every single point.

Price does not have a fairly normal distribution.

But the logarithm of price does.

Back to the question at hand: how to choose our predictors?.One strategy is to build simple linear regression models with each predictor individually and then pick the top n performers.

We initially built our own functions from scratch to automatically build and test simple models for us.

However, we then discovered that sklearn contains an entire feature selection module that already does this even better — ah, the joys of Python!Simply pass in a linear regression object and declare that you want to use n features to make your predictions, and voila: RFE tells you exactly which ones to use.

This is no arbitrary task, since the collinearity issue means that adding a new predictor can reduce the usefulness of the previous ones.

For now we can just be grateful that this particular trail has already been blazed by those brave souls that were optimizing their linear regression models before us.

Our final model’s accuracy looks like this.

By plotting the quantiles of our model’s error values at different price points against a standard distribution, we’re able to get a direct visual of what is in reality a 9-dimensional model (because we used 9 predictors to get these results).

At most points it’s remarkably accurate, although at the low end and high end of real estate prices it trails off somewhat for some reason.

We had one week to build this project.

Given more time, how could it be improved further?.We have several ideas.

Zipcodes have some correlation with price, but higher zipcode numbers don’t necessarily correlate with higher price.

By making dummies of zipcodes we can cause a modelable relationship to emerge, giving us one more useful predictor.

Similarly, we used latitude and longitude to artificially engineer a distance from downtown Seattle feature.

However, there are several employers and in the greater Seattle area that drive significantly higher real estate prices in their respective locales, and so by engineering features for distance from each of these accuracy might be improved significantly.

We might also seek out more data to train our model on, or transform the sale dates into a more usable format so that we might use them to model the seasonality of house sales.

All this and more would be possible with just slightly more time.

In conclusion, predictor selection is somewhere in between an art and a science.

Figuring out how to balance bias and variance is partially a matter of using established tools and methods, but also simply of using good judgement and intuition.

I hope that while reading this you’ve learned something new and gained some useful insight into the feature selection process!.. More details