They collect data on the characteristics of each property and use machine learning algorithms to make predictions.

In this article, I’ll demonstrate a similar analysis using a data set included in Kaggle’s “House Prices” competition.

Exploratory Data AnalysisFirst, lets take a look at the response variable “Sale Price”.

It’s positively skewed; most houses sold for between $100,000 and $250,000, but some sold for substantially more.

Figure 1: Observed sale priceThe data set contains 80 features that describe characteristics of the property, including the number of bathrooms, basement square footage, year built, garage square footage, etc.

The heat map (Figure 2) shows the correlation among each feature and the response variable “SalePrice”.

This gives us information about the feature importance in predicting the Sale Price and indicates where there may be multicolinearity.

The overall quality of the home “OverallQual” is highly correlated with Sale Price, not surprisingly.

In contrast, the year the home was sold “YrSold” has little correlation with the Sale Price.

Figure 2: Heat map showing the correlation among features and sale priceData CleaningDealing with NAsThere are lots of NAs in this data set; some features are almost all NAs, while there are many that have just a few.

We can remove features that offer little information such as Utilities.

df.

Utilities.

describe()All but one property is assigned the “Allpub” category for Utilities, so we can just remove that feature.

Due to the lack of variation, the feature has little correlation with our response Sale Price (Figure 2), so we’re not that worried about losing it.

df=df.

drop([‘Utilities’], axis=1)Few NAs are random, in that the lack of information usually has something to do with the the record itself, and not simply because of a collection error.

For example, NA recorded for GarageType probably means there isn’t a garage on the property.

In this data set there are both categorical and continuous features pertaining to garages.

We can fill them in accordingly with 0 and “None” for properties that have NAs for those features, indicating a lack of garage space.

# Garage categorical features to nonefor i in (‘GarageType’, ‘GarageFinish’, ‘GarageQual’, ‘GarageCond’): df[i] = df[i].

fillna(‘None’)# Garage continuous features to 0for i in (‘GarageYrBlt’, ‘GarageArea’, ‘GarageCars’): df[i] = df[i].

fillna(0)NAs for other features don’t have a clear explanation associated with the lack of information.

In this case, we can observe the frequency of occurrence for each record, and choose the most probable value.

Lets look at the frequency distribution for the feature “MSZoning” describing the zoning classification.

Figure 3: Frequency of zoning classificationThe classification for residential low density (RL) is by far the most common.

A pragmatic approach to addressing the four NAs in this feature will be to simply replace NAs with “RL”.

df.

MSZoning=df[‘MSZoning’].

fillna(df[‘MSZoning’].

mode()[0])Data TransformationTo maximize the performance of our model, we want to normalize our features and response variable.

As we saw in Figure 1, our response variable is positively skewed.

By applying a log transformation, Sale Price now resembles a normal distribution (Figure 4).

resp=np.

log1p(resp) # transform by log(1+x)Figure 4: Log transformed response variable Sale PriceWe’ll have to check all the continuous features for skew as well.

# identify numerical featuresnum_feats=df.

dtypes[df.

dtypes!=’object’].

index# quantify skewskew_feats=df[num_feats].

skew().

sort_values(ascending=False)skewness=pd.

DataFrame({‘Skew’:skew_feats})skewness=skewness[abs(skewness)>0.

75].

dropna()skewed_features=skewness.

indexSkewness is going to vary a lot between all these features we want to transform.

A box cox transformation provides a flexible way of transforming features that may each require an alternate approach.

The function boxcox will estimate the optimal lambda value (a parameter in the transformation) and return the transformed feature.

# add one to all skewed features, so we can log transform if neededdf[skewed_features]+=1# conduct boxcox transformationfrom scipy.

stats import boxcox# apply to each of the skewed featuresfor i in skewed_features: df[i],lmbda=boxcox(df[i], lmbda=None)One-Hot EncodingFinally, we’ll need to one-hot encode (or dummy code) our categorical variables so they can be interpreted by the model.

df=pd.

get_dummies(df)ModelingWe’re going to fit two widely applied machine learning models to the training data and evaluate their relative performance using cross-validation.

Random Forest RegressorTo insure our random forest regressor model has attributes that maximize its predictive capabilities, we’re going to optimize the hyperparameter values.

We want to estimate the optimal values for:n_estimators: number of trees in the forestmax_features: maximum number of features to consider at each splitmax_depth: maximum number of splits in any treemin_samples_split: minimum number of samples required to split a nodemin_samples_leaf: minimum number of samples required at each leaf nodebootstrap: whether the data set is bootstrapped or whether the whole data set is used for each treen_estimators=[int(x) for x in np.

linspace(start = 200, stop = 2000, num = 10)]max_features = [‘auto’, ‘sqrt’, ‘log2’]max_depth = [int(x) for x in np.

linspace(10, 110, num = 11)]max_depth.

append(None)min_samples_split = [2, 5, 10]min_samples_leaf = [1, 2, 4]bootstrap = [True, False]grid_param = {‘n_estimators’: n_estimators, ‘max_features’: max_features, ‘max_depth’: max_depth, ‘min_samples_split’: min_samples_split, ‘min_samples_leaf’: min_samples_leaf, ‘bootstrap’: bootstrap}If we used GridSearchCV from sci-kit learn to identify the optimal hyperparameters, we would be evaluating 6,480 candidate models and 32,400 fits with cross-validation of five folds.

That would be very computationally expensive, so instead we’ll use RandomizedSearchCV that evaluates a specified number of candidate models (n_iter) with randomly selected hyperparameters from our defined parameter space.

We’re going to do k-fold cross-validation using five folds.

from sklearn.

ensemble import RandomForestRegressor# the model prior to hyperparameter optimizationRFR=RandomForestRegressor(random_state=1)from sklearn.

model_selection import RandomizedSearchCVRFR_random = RandomizedSearchCV(estimator = RFR, param_distributions = grid_param, n_iter = 500, cv = 5, verbose=2, random_state=42, n_jobs = -1)RFR_random.

fit(train, resp) print(RFR_random.

best_params_)Now we have a model with attributes best suited for our data.

Best_RFR = RandomForestRegressor(n_estimators=1000, min_samples_split=2, min_samples_leaf=1,max_features=’sqrt’, max_depth=30, bootstrap=False)We want a precise measurement of how the home prices predicted by the model differed from the actual prices of the homes sold.

We’ll calculate the root mean squared error (RMSE)for the model through k-fold cross-validation.

Given five folds, we’ll use the mean RMSE value of each of the five sets of model fits.

from sklearn.

model_selection import KFold, cross_val_scoren_folds=5def rmse_cv(model): kf = KFold(n_folds,shuffle=True,random_state=42).

get_n_splits(train) rmse= np.

sqrt(-cross_val_score(model, train, resp, scoring=”neg_mean_squared_error”, cv = kf)) return(rmse.

mean()) rmse_cv(Best_RFR)The random forest model does fairly well, with a mean RMSE of .

149.

Lets try another model to see if we can obtain better predictions.

Gradient Boosting RegressorWe’ll conduct the same evaluation using RandomizedSearchCV to identify the optimal hyperparameters.

The gradient boosting regressor we’ll use from “xgboost” has the following hyperparameters we’ll want to optimize:n_estimators: number of treessubsample: percentage of samples per treemax_depth: maximum number of levels in each treemin_child_weight: minimum sum of weights of all observations required in a childcolsample_bytree: percentage of features used per treelearning_rate: learning rate or step size shrinkagegamma: minimum reduction of the cost function required to make a splitn_estimators=[int(x) for x in np.

linspace(start = 200, stop = 2000, num = 10)]subsample = [.

6,.

7,.

8,.

9,1]max_depth = [int(x) for x in np.

linspace(10, 50, num = 10)]min_child_weight = [1,3,5,7]colsample_bytree=[.

6,.

7,.

8,.

9,1]learning_rate=[.

01,.

015,.

025,.

05,.

1]gamma = [.

05,.

08,.

1,.

3,.

5,.

7,.

9,1]rand_param = {‘n_estimators’: n_estimators, ‘subsample’: subsample, ‘max_depth’: max_depth, ‘colsample_bytree’: colsample_bytree, ‘min_child_weight’: min_child_weight, ‘learning_rate’: learning_rate, ‘gamma’: gamma}Using the same approach employed for the random forest model, we’ll run the randomized hyperparameter search using k-fold cross-validation.

Boost_random = RandomizedSearchCV(estimator = Boost, param_distributions = rand_param, n_iter = 500, cv = 5, verbose=2, random_state=42, n_jobs = -1)Boost_random.

fit(train, resp)We can now calculate the RMSE for the tuned model and compare xgboost’s performance to the random forest model.

Best_Boost = XGBRegressor(subsample=.

7, n_estimators=1600, min_child_weight=3, max_depth=41,learning_rate=.

025, gamma=.

05, colsample_bytree=.

6)# evaluate rmsermse_cv(Best_Boost)Our gradient boosting regression model exhibited superior performance to the random forest model with a RMSE value of 0.

131.

Making Final PredictionsI took a pragmatic approach to modeling in this analysis; there are additional modeling techniques that can marginally increase the prediction accuracy such as stacking or applying a suite of alternate models (e.

g.

Lasso, ElasticNet, KernalRidge).

We’ll just apply the best model from this analysis (gradient boosting regression) to the test set and evaluate its performance.

# fit to the training dataBest_Boost.

fit(train,resp)# transform predictions using exponential functionypred=np.

expm1(Best_Boost.

predict(test))# make a data frame to hold predictions, and submit to Kagglesub=pd.

DataFrame()sub['Id']=test['Id']sub['SalePrice']=ypredsub.

to_csv('KaggleSub.

csv', index=False)The Gradient Boosting Regression model performed with a RMSE value of 0.

1308 on the test set, not bad!ConclusionWe can make reasonable predictions about the price a house will sell for based on characteristics of the property.

Key steps include assigning appropriate values for NAs, normalizing variables, optimizing hyperparameters for candidate models, and choosing the best model.

I appreciate any feedback and constructive criticism.

The code associated with this analysis can be found on github.

com/njermain.. More details