Closing the Sale: Predicting Home Prices via Linear Regression

However, the results are difficult to interpret (selling price in log or square root dollars) and we aren’t able to predict which outliers are in which training or test set.Nevertheless, as data scientists we must make certain assumptions and work within the confines of the data that is currently available.Imports, Data Cleansing, and EDACleaning and EDA are important for this challenge as this data set contains many ordinal / categorical features that may be important in categorization and will need to be converted to numerical values.As a baseline, I imported the following libraries to clean, explore and model the training data.One of my first preprocessing steps was to convert all object types into numerical features that can be used in a Linear Regression..According to the data dictionary, there were several categorical features that were ordinal in nature..Using my personal discretion, I converted values that would seem important to a perspective home buyer on a scale from 0 (if NA was an option) or 1 (if NA was not an option) upward (depending on the number of features.While I could have used the get_dummies method, by using a range of values (which can then be scaled) allows for more nuance and reduced number of a features overall.Converting some of the categorical featuresAdditionally, there were many missing / null values within this data set that needed to be imputed..For features with a large number of missing values (e.g. ‘Lot Frontage’), it is was more desirable to impute these with “real” versus “dummy” values (e.g., 0s) ..This can be done by using the mean, median, mode, or some other correlated function.For the sake of simplicity and time, I did not use a calculation to impute for all missing values..I instead focused on the variables with the largest proportion of nulls..For all other features, I used .fillna(0)..Using a basic, and somewhat arbitrary assumption, I imputed the missing values for lot frontage via the following:I prepared the non-ordinal categorical features to be transformed via .get_dummies by converting certain numeric features into strings:Converting numerical features to string typesFinally in order to corroborate my initial assumptions on features importance, I created a Seaborn heatmap which calculated Pearson’s coefficient in relation to my target feature: sale price.Feature Engineering & SelectionA key skill for all data scientists is the art and science of feature selection..This requires both rigorous statistical testing and subject matter expertise (or intuition) to filter signals from noise..One interesting point that arose while conducting my initial data exploration was an apparent correlation between parcel ID and sale price..Through a scatter plot, we can spot both the bifurcation in the parcel IDs (5 vs 9s) and the tendency for homes with a PID beginning in 9 to be clustered at lower values..While this may be statistically significant, I did not conduct a hypothesis test to validate.. More details