Machine Learning Project: Predicting Boston House Prices With Regression

")print("Minimum price: \${}".

format(minimum_price)) print("Maximum price: \${}".

format(maximum_price))print("Mean price: \${}".

format(mean_price))print("Median price \${}".

format(median_price))print("Standard deviation of prices: \${}".

format(std_price))Feature ObservationData Science is the process of making several assumptions and hypothesis on the data, and testing them by performing some tasks.

Initially we could make the following intuitive assumptions for each feature:Houses with more rooms (higher ‘RM’ value) will worth more.

Usually houses with more rooms are bigger and can fit more people, so it is reasonable that they cost more money.

They are directly proportional variables.

Neighborhoods with more lower class workers (higher ‘LSTAT’ value) will worth less.

If the percentage of lower working class people is higher, it is likely that they have low purchasing power and therefore, they houses will cost less.

They are inversely proportional variables.

Neighborhoods with more students to teachers ratio (higher ‘PTRATIO’ value) will be worth less.

If the percentage of students to teachers ratio people is higher, it is likely that in the neighborhood there are less schools, this could be because there is less taxes income which could be because in that neighborhood people earn less money.

If people earn less money it is likely that their houses are worth less.

They are inversely proportional variables.

We’ll find out if these assumptions are accurate through the project.

Exploratory Data AnalysisScatterplot and HistogramsWe will start by creating a scatterplot matriz that will allow us to visualize the pair-wise relationships and correlations between the different features.

It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outiers.

import matplotlib.

pyplot as pltimport seaborn as sns%matplotlib inline# Calculate and show pairplotsns.

pairplot(data, size=2.

5)plt.

tight_layout()We can spot a linear relationship between ‘RM’ and House prices ‘MEDV’.

In addition, we can infer from the histogram that the ‘MEDV’ variable seems to be normally distributed but contain several outliers.

Correlation MatrixWe are going to create now a correlation matrix to quantify and summarize the relationships between the variables.

This correlation matrix is closely related witn covariance matrix, in fact it is a rescaled version of the covariance matrix, computed from standardize features.

It is a square matrix (with the same number of columns and rows) that contains the Person’s r correlation coefficient.

# Calculate and show correlation matrixcm = np.

corrcoef(data.

values.

T)sns.

set(font_scale=1.

5)hm = sns.

heatmap(cm, cbar=True, annot=True, square=True, fmt='.

2f', annot_kws={'size': 15}, yticklabels=cols, xticklabels=cols)To fit a regression model, the features of interest are the ones with a high correlation with the target variable ‘MEDV’.

From the previous correlation matrix, we can see that this condition is achieved for our selected variables.

Developing a ModelIn this second section of the project, we will develop the tools and techniques necessary for a model to make a prediction.

Being able to make accurate evaluations of each model’s performance through the use of these tools and techniques helps to greatly reinforce the confidence in the predictions.

Defining a Performace MetricIt is difficult to measure the quality of a given model without quantifying its performance on the training and testing.

This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement.

For this project, we will calculate the coefficient of determination, R², to quantify the model’s performance.

The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how “good” that model is at making predictions.

The values for R² range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable.

A model with an R² of 0 is no better than a model that always predicts the mean of the target variable, whereas a model with an R² of 1 perfectly predicts the target variable.

Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features.

A model can be given a negative R2 as well, which indicates that the model is arbitrarily worse than one that always predicts the mean of the target variable.

# Import 'r2_score'from sklearn.

metrics import r2_scoredef performance_metric(y_true, y_predict): """ Calculates and returns the performance score between true (y_true) and predicted (y_predict) values based on the metric chosen.

""" score = r2_score(y_true, y_predict) # Return the score return scoreShuffle and Split DataFor this section we will take the Boston housing dataset and split the data into training and testing subsets.

Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.

# Import 'train_test_split'from sklearn.

model_selection import train_test_split# Shuffle and split the data into training and testing subsetsX_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.

2, random_state = 42)# Successprint("Training and testing split was successful.

")Training and TestingYou may ask now:What is the benefit to splitting a dataset into some ratio of training and testing subsets for a learning algorithm?It is useful to evaluate our model once it is trained.

We want to know if it has learned properly from a training split of the data.

There can be 3 different situations:1) The model didn´t learn well on the data, and can’t predict even the outcomes of the training set, this is called underfitting and it is caused because a high bias.

2) The model learn too well the training data, up to the point that it memorized it and is not able to generalize on new data, thisi is called overfitting, it is caused because high variance.

3) The model just had the right balance between bias and variance, it learned well and is able predict correctly the outcomes on new data.

Analyzing Model’s PerformanceIn this third section of the project, we’ll take a look at several models’ learning and testing performances on various subsets of training data.

Additionally, we’ll investigate one particular algorithm with an increasing 'max_depth' parameter on the full training set to observe how model complexity affects performance.

Graphing the model's performance based on varying criteria can be beneficial in the analysis process, such as visualizing behavior that may not have been apparent from the results alone.

Learning CurvesThe following code cell produces four graphs for a decision tree model with different maximum depths.

Each graph visualizes the learning curves of the model for both training and testing as the size of the training set is increased.

Note that the shaded region of a learning curve denotes the uncertainty of that curve (measured as the standard deviation).

The model is scored on both the training and testing sets using R2, the coefficient of determination.

# Produce learning curves for varying training set sizes and maximum depthsvs.

ModelLearning(features, prices)Learning the DataIf we take a close look at the graph with the max depth of 3:As the number of training points increases, the training score decreases.

In contrast, the test score increases.

As both scores (training and testing) tend to converge, from the 300 points treshold, having more training points will not benefit the model.

In general, with more columns for each observation, we’ll get more information and the model will be able to learn better from the dataset and therefore, make better predictions.

Complexity CurvesComplexity CurvesThe following code cell produces a graph for a decision tree model that has been trained and validated on the training data using different maximum depths.

The graph produces two complexity curves — one for training and one for validation.

Similar to the learning curves, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the performance_metric function.

# Produce complexity curve for varying training set sizes and maximum depthsvs.

ModelComplexity(X_train, y_train)Bias-Variance TradeoffIf we analize how the bias-variance vary with the maximun depth, we can infer that:With the maximun depth of one, the graphic shows that the model does not return good score in neither training nor testing data, which is a symptom of underfitting and so, high bias.

To improve performance, we should increase model’s complexity, in this case increasing the max_depth hyperparameter to get better results.

With the maximun depth of ten, the graphic shows that the model learn perfectly well from training data (with a score close to one) and also returns poor results on test data, which is an indicator of overfitting, not being able to generalize well on new data.

This is a problem of High Variance.

To improve performance, we should decrease the model’s complexity, in this case decreasing the max_depth hyperparameter to get better results.

Best-Guess Optimal ModelFrom the complexity curve, we can infer that the best maximum depth for the model is 4, as it is the one that yields the best validation score.

In addition, for more depth although the training score increases, validation score tends to decrease which is a sign of overfitting.

Evaluating Model ‘s PerformanceIn this final section of the project, we will construct a model and make a prediction on the client’s feature set using an optimized model from fit_model.

Grid SearchThe grid search technique exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter, which is a dictionary with the values of the hyperparameters to evaluate.

One eample can be:param_grid = [ {‘C’: [1, 10, 100, 1000], ‘kernel’: [‘linear’]}, {‘C’: [1, 10, 100, 1000], ‘gamma’: [0.

001, 0.

0001], ‘kernel’: [‘rbf’]}, ]In this example, two grids should be explored: one with a linear kernel an C values of [1,10,100,1000], and the second one with an RBF kernel, and the cross product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.

001, 0.

0001].

When fitting it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

Cross-ValidationK-fold cross-validation is a technique used for making sure that our model is well trained, without using the test set.

It consist in splitting data into k partitions of equal size.

For each partition i, we train the model on the remaining k-1 parameters and evaluate it on partition i.

The final score is the average of the K scores obtained.

When evaluating different hyperparameters for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally.

This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.

To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets (training, validating and testing sets), we drastically reduce the number of samples which can be used for learning the model, and the resulting model may not be sufficiently well trained (underfitting).

By using k-fold validation we make sure that the model uses all the training data available for tunning the model, it can be computationally expensive but allows to train models even if little data is available.

The main purpose of k-fold validation is to get an unbiased estimate of model generalization on new data.

Fitting a ModelThe final implementation requires that we bring everything together and train a model using the decision tree algorithm.

To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth'parameter for the decision tree.

The 'max_depth' parameter can be thought of as how many questions the decision tree algorithm is allowed to ask about the data before making a prediction.

In addition, we will find your implementation is using ShuffleSplit() for an alternative form of cross-validation (see the 'cv_sets'variable).

The ShuffleSplit() implementation below will create 10 ('n_splits') shuffled sets, and for each shuffle, 20% ('test_size') of the data will be used as the validation set.

# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'from sklearn.

tree import DecisionTreeRegressorfrom sklearn.

metrics import make_scorerfrom sklearn.

model_selection import GridSearchCVdef fit_model(X, y): """ Performs grid search over the 'max_depth' parameter for a decision tree regressor trained on the input data [X, y].

""" # Create cross-validation sets from the training data cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.

20, random_state = 0) # Create a decision tree regressor object regressor = DecisionTreeRegressor() # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10 params = {'max_depth':[1,2,3,4,5,6,7,8,9,10]} # Transform 'performance_metric' into a scoring function using 'make_scorer' scoring_fnc = make_scorer(performance_metric) # Create the grid search cv object –> GridSearchCV() grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets) # Fit the grid search object to the data to compute the optimal model grid = grid.

fit(X, y) # Return the optimal model after fitting the data return grid.

best_estimator_Making PredictionsOnce a model has been trained on a given set of data, it can now be used to make predictions on new sets of input data.

In the case of a decision tree regressor, the model has learned what the best questions to ask about the input data are, and can respond with a prediction for the target variable.

We can use these predictions to gain information about data where the value of the target variable is unknown, such as data the model was not trained on.

Optimal ModelThe following code snippet finds the maximum depth that return the optimal model.

# Fit the training data to the model using grid searchreg = fit_model(X_train, y_train)# Produce the value for 'max_depth'print("Parameter 'max_depth' is {} for the optimal model.

".

format(reg.

get_params()['max_depth']))Predicting Selling PricesImagine that we were a real estate agent in the Boston area looking to use this model to help price homes owned by our clients that they wish to sell.

We have collected the following information from three of our clients:What price would we recommend each client sell his/her home at?Do these prices seem reasonable given the values for the respective features?To find out the answers of these questions we will execute the folowing code snippet and discuss its output.

# Produce a matrix for client dataclient_data = [[5, 17, 15], # Client 1 [4, 32, 22], # Client 2 [8, 3, 12]] # Client 3# Show predictionsfor i, price in enumerate(reg.

predict(client_data)): print("Predicted selling price for Client {}'s home: \${:,.

2f}".

format(i+1, price))From the statistical calculations done at the beginning of the project we found out the following information:Minimum price: \$105000.

0Maximum price: \$1024800.

0Mean price: \$454342.

9447852761Median price \$438900.

0Standard deviation of prices: \$165340.

27765266786Given these values, we can conclude:Selling price for client 3 is near the million dollars, which is near the maximum of the dataset.

This is a reasonable price because of its features (8 rooms, very low poverty level and low student-teacher ratio), the house may be in a wealthy neighborhood.

Selling price for client 2 is the lowest of the three and given its features is reasonable as it is near the minimum of the dataset.

For client 1, we can see that its features are intermediate between the latter 2, and therefore, its price is quite near the mean and median.

And our initial assumptions of the features are confirmed:‘RM’, has a directy proportional relationship with the dependent variable ‘Prices’.

In contrast, ‘LSTAT’ and ‘PTRATIO’ have a inversely proportional relationship with the dependent variable ‘PRICES’.

Model’s SensitivityAn optimal model is not necessarily a robust model.

Sometimes, a model is either too complex or too simple to sufficiently generalize to new data.

Sometimes, a model could use a learning algorithm that is not appropriate for the structure of the data given.

Other times, the data itself could be too noisy or contain too few samples to allow a model to adequately capture the target variable — i.

e.

, the model is underfitted.

The code cell below run the fit_model function ten times with different training and testing sets to see how the prediction for a specific client changes with respect to the data it's trained on.

vs.

PredictTrials(features, prices, fit_model, client_data)We obtained a range in prices of nearly 70k\$, I believe that this is a quite large deviation as it represents approximately a 17% of the median value of house prices.

Model’s ApplicabilityNow, we use these results to discuss whether the constructed model should or should not be used in a real-world setting.

Some questions that are worth to answer are:How relevant today is data that was collected from 1978?.How important is inflation?Data collected from 1978 is not of much value in today’s world, society and economics have changed so much and in addtion, inflation has made a great impact on the prices.

Are the features present in the data sufficient to describe a home!.Do you think factors like quality of apppliances in the home, square feet of the plot area, presence of pool or not etc should factor in?The dataset considered is quite limited, there are a lot of features like the size of the house in square feet, the presence of pool or not and others, that are very relevant when considering a house price.

Is the model robust enough to make consistent predictions?Given the high variance on the prince range, we can assure that it is not a robust model and, therefore, not appropiate for making predictions.

Would data collected in an urban city like Boston be applicable in a rural city?Data collected from a big urban city like Boston would not be applicable in a rural city, as for equal value of feaures prices are much higher in the urban area.

Is it fair to judge the price of an individual home based on the characteristics of the entire neighborhood?In general it is not fair to estimate or predict the price of an indivual home based on the features of the entire neighborhood.

In the same neighborhood there can be huge differences in prices.

ConclusionThroughout this article we made a machine learning regression project from end-to-end and we learned and obtained several insights about regression models and how they are developed.

This was the first of the machine learning projects that will be developed on this series.

If you liked it, stay tuned for the next article.Which will be an introudction to the theory and concepts regarding to Unsupervised Learning.

.