Log Book — Practical guide to Linear & Polynomial Regression in RThis is a practical guide to linear and polynomial regression in R.
I have tried to cover the basics of theory and practical implementation of those with the King County Data-set.
Dip Ranjan ChatterjeeBlockedUnblockFollowFollowingJun 26I am relatively new in my journey within the Data Science realm and this article is sort of a note on a topic that I make.
These might be of help to other fellow enthusiasts.
Please note that as these are my study notes there are sections which are referenced form other sources.
I have given a detailed reference at the end of all the sources which have helped me.
The Regression EquationSimple linear regression estimates exactly how much Y will change when X changes by a certain amount.
With the correlation coefficient, the variables X and Y are interchangeable.
With regression, we are trying to predict the Y variable from X using a linear relationship (i.
, a line):Y = b0 + b1 XThe symbol b0 is known as the intercept (or constant), and b1 as the slope for X.
Both appear in R output as coefficients.
The Y variable is known as the response or dependent variable since it depends on X.
The X variable is known as the predictor or independent variable.
The machine learning community tends to use other terms, calling Y the target and X a feature vector.
Fitted Values and ResidualsImportant concepts in regression analysis are the fitted values and residuals.
In general, the data doesn’t fall exactly on a line, so the regression equation should include an explicit error term :The fitted values, also referred to as the predicted values, are typically denoted by (Y-hat).
These are given by:where b0 & b1 indicates that the coefficients are estimated versus known.
Multiple Linear RegressionWhen there are multiple predictors, the equation is simply extended to accommodate them:Instead of a line, we now have a linear model — the relationship between each coefficient and its variable (feature) is linear.
We will have a plane which will pass through the sample points right on the “middle”, something like the image below.
In two dimensions, it’s a line, in three a plane, in N, a hyperplane.
Now that we have an idea on what is linear regression is, we will go through the details of different types of regression models to predict the price of the very popular King County housing data set.
You can download the data set from here.
The goal is that we use different models to define a plane which best fits our data with low error rate.
This plane/line will then be used to predict outcomes as new data come in.
King County Housing DataBefore beginning with the analysis, please make sure you have the below R packages installed and loaded:library(doParallel)library(‘randomForest’)library(MASS)library(readr)library(ggplot2)library(corrplot)library(mlbench)library(Amelia)library(plotly)library(reshape2)library(caret)library(caTools)library(dplyr)library(moments)Exploratory Data AnalysisAn example of using regression is in estimating the value of houses.
County assessors must estimate the value of a house for the purposes of assessing taxes.
Real estate consumers and professionals consult popular websites such as Zillow to ascertain a fair price.
Here are a few rows of housing data from King County (Seattle), Washington, from the house data.
frame:house <- read.
csv”)head(house)The data-set itself consists of 21,613 examples with 19 features[not considering the id and date columns].
The features are:Now if you notice the data, few values you see are given as categorical for example waterfront, view, condition, grade, etc.
Now there is also a scope for you to group the zip code variable, as the zip code number on its own does not define anything, but the price of a house is location dependent.
If you know that some zip codes belong to the posh area of the town then you can group them into zipgroup1, next such area as zipgroup2 and so on.
We will not be doing that here as we have very little idea as to which zip code area belongs to which part of town.
But we do have the lat and long values so we will be using those.
But if you want to learn more about converting data into categorical values then you can refer to this link.
It is usually done using the relevel command.
Missing valuesYou can check for missing data using the below command:missmap(house,col=c(‘yellow’,’black’),y.
labels=’’,legend=TRUE)as per this graph there are no missing values.
Now we will remove the id column and the date column (all the information is gathered within one year only).
house <- house[,-c(1,2)]Correlated PredictorsIn multiple regression, the predictor variables are often correlated with each other.
Correlation is a problem because independent variables should be independent, and if two variables are correlated then it means that you are basically using almost same feature multiple times.
This will give incorrect results.
Once you find the correlated variables you can keep one and remove another.
Many methods perform better if highly correlated attributes are removed.
You can check the correlated variables using the below command:corrplot(cor(house),type=”upper”,method=”color”,addCoef.
col = “black”,tl.
col = “black”, number.
cex = 0.
6,mar=c(0,0,0,0))Now that we know which pairs of variables are correlated we can choose which one keep and which one to remove:[sqft_living, grade] : 0.
76 , will keep sqft_living[sqft_living, sqft_above] : 0.
88 , will keep sqft_living[sqft_living, sqft_living15] : 0.
76 , will keep sqft_living[sqft_lot, sqft_lot15] : 0.
72 , will keep sqft_lotSo we are removing grade, sqft_above, sqft_living15, sqft_lot15 as the information carried by them is already provided by the other variables.
house <- house[,-c(10,11,18,19)]DimensionalityIf you want to reduce the dimensionality of your data-set, you apply the function nearZeroVar from the caret package.
It diagnoses predictors that have either one or very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.
As of now we won’t be doing that as we have 14 features left.
Though I would remove the zipcode from the dataset as the location data is contained in lat and long.
house <- house[,-c(13)]SkewnessNext we will try to analyze the skewness of our numeric variables.
Skewness check is done to make the data normal which is one of the assumptions of linear regression.
Some people suggest here that an acceptable range of values for skewness lies between (-2,2).
Consequently, we detect which variables are not within this range and they will be transformed using the log function[we will not transform the categorical data].
apply(house, 2, skewness, na.
rm =TRUE)Transforming the required data to log:house$price <- log(house$price)house$sqft_lot <- log(house$sqft_lot)The difference in the skewness before and after transforming can be checked in the below 2 histograms:Data ModelingNow that we have done our EDA, we will start with the goal to predict the sales price from the other variables.
To begin with we will take all the variables/features to make our model; the argument na.
omit causes the model to drop records that have missing values.
The loaded dataset is separated into training and validation sets using a 70/30 split.
Now, we have a training dataset that we will use to train our models and a validation set to use later to measure the performance of our models.
seed(1000)i <- sample(2, nrow(house), replace=TRUE, prob=c(0.
3))trainData <- house[i==1,]testData <- house[i==2,]model_lm <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +waterfront+ view + sqft_basement + condition + yr_built + lat + long, data = trainData)summary(model_lm)As we can see running a summary of our linear model, the coefficient of determination (or R-Squared) is good.
8% of the variance in the outcome is predictable from the linear regression model that we created.
Next we will analyse the RMSE (or Root Mean Square Error) of our model.
prediction_lm <- predict(model_lm, testData, type=”response”)model_lm_output <- cbind(testData, prediction_lm)# RMSE valueRMSE_lm <- sqrt(mean((model_lm_output$price — model_lm_output$prediction_lm)^2,na.
rm=TRUE))print(RMSE_lm)As we can see above, our RMSE is 0.
It measures the differences between prices predicted by our model and the actual values.
The lower the value, the better it is.
Ours is close to 0 so it is a good indicator.
Now that we have a basic model we need to understand the performance of the model:The most important performance metric from a data science perspective is root mean squared error, or RMSE.
RMSE is the square root of the average squared error in the predicted values:You can calculate the RMSE in the above model byRMSE_house_lm <- sqrt(mean(residuals(house_lm)^2))This measures the overall accuracy of the model, and is a basis for comparing it to other models (including models fit using machine learning techniques).
Similar to RMSE is the residual standard error, or RSE.
In this case we have p predictors, and the RSE is given by:MSE_house_lm <- (mean(residuals(house_lm)^2))The only difference is that the denominator is the degrees of freedom, as opposed to number of records.
In practice, for linear regression, the difference between RMSE and RSE is very small, particularly for big data applications.
In our case the observed price values deviate from the predicted price by approximately 202700 units in average.
This corresponds to an error rate of 202700/mean(house$price) = 37.
5%, which is high.
Residual sum of squares (RSS) is the sum of the squared residuals:rss <- sum(residuals(fit)^2)Residual standard error (RSE) is the square root of (RSS / degrees of freedom):rse <- sqrt( sum(residuals(fit)^2) / fit$df.
residual )Another useful metric that you will see in software output is the coefficient of determination, also called the R-squared statistic.
R-squared ranges from 0 to 1 and measures the proportion of variation in the data that is accounted for in the model.
It is useful mainly in explanatory uses of regression where you want to assess how well the model fits the data.
The formula is:The denominator is proportional to the variance of Y.
The output from R also reports an adjusted R-squared, which adjusts for the degrees of freedom; seldom is this significantly different in multiple regression.
In our case we have R2 as 0.
69 which is ok.
Along with the estimated coefficients, R reports the standard error of the coefficients (SE) and a t-statistic:The t-statistic — and its mirror image, the p-value — measures the extent to which a coefficient is “statistically significant” — that is, outside the range of what a random chance arrangement of predictor and target variable might produce.
The higher the t-statistic (and the lower the p-value), the more significant the predictor.
WARNINGIn addition to the t-statistic, R and other packages will often report a p-value (Pr(>|t|) in the R output) and F-statistic.
Data scientists do not generally get too involved with the interpretation of these statistics, nor with the issue of statistical significance.
Data scientists primarily focus on the t-statistic as a useful guide for whether to include a predictor in a model or not.
High t-statistics (which go with p-values near 0) indicate a predictor should be retained in a model, while very low t-statistics indicate a predictor could be dropped.
We can get some insights from the graphic representation of our linear model:# Graphic representationpar(mfrow=c(2,2))plot(model_lm)Now how do we read these graphs?Residuals vs Fitted: there should be no strong patterns and no outliers since the residuals are randomly distributed around zero.
As we can see in our plot, all our residuals are distributed within (-1,1) and close to 0.
Normal Q-Q: Residuals should be normally distributed around the diagonal line, as it happens in our case.
Scale-Location: It is similar to the first graph, but the residuals in the Y axis are scaled.
There should not be any strong pattern, as it happens in this case although we have a smooth positive diagonal lineResiduals vs Leverage: It is useful to detect outliers.
In our case, it seems that our plot does not contain any outlier beyond the discontinuous red line (Cook´s Distance).
Look at the values of Cook’s distance: any value above 1 indicates a case that might be influencing the model.
In statistics, Cook’s distance or Cook’s D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis.
In a practical ordinary least squares analysis, Cook’s distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points.
Some texts tell you that points for which Cook’s distance is higher than 1 are to be considered as influential.
Other texts give you a threshold of 4/N or 4/(N−k−1), where N is the number of observations and k the number of explanatory variables.
John Fox, in his booklet on regression diagnostics is rather cautious when it comes to giving numerical thresholds.
He advises the use of graphics and to examine in closer details the points with “values of D that are substantially larger than the rest”.
According to Fox, thresholds should just be used to enhance graphical displays.
In our data we can see at least 2 such points which warrants out attention.
cooksd <- cooks.
distance(model_lm)# Plot the Cook’s Distancesample_size <- nrow(trainData)plot(cooksd, pch=”*”, cex=2, main=”Influential Obs by Cooks distance”) # plot cook’s distancetext(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>0.
1, names(cooksd),””), col=”red”) # add labelsWith the above command and the graph we can get the row numbers of the outlier data, if you check these data manually you will see there are discrepancies in this data(one has got a bedroom count of 33), let us see what happens if we remove these:#try to remove the top x outliers to have a looktop_x_outlier <- 2influential <- as.
numeric(names(sort(cooksd, decreasing = TRUE)[1:top_x_outlier]))house2 <- house[-c(15871,12778), ]Again running the model, we get the below RMSE which has a minor improvement on the previous result:Rank Features By ImportanceIt is very difficult to guess which features to include in our model, as we did not know which features to select we took all features in our initial model.
The importance of features can be estimated from data by building a model.
Some methods like decision trees have a built in mechanism to report on variable importance.
For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.
The varImp is then used to estimate the variable importance, which is printed and plotted.
# prepare training schemecontrol <- trainControl(method=”repeatedcv”, number=10, repeats=3)# train the modelmodel <- train(price~.
, data=trainData, method=”lm”, preProcess=”scale”, trControl=control)# estimate variable importanceimportance <- varImp(model, scale=FALSE)# summarize importanceprint(importance)# plot importanceplot(importance)After running the above command we can see the features which the model thinks are important.
Feature SelectionAutomatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.
A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.
The example below provides an example of the RFE method on the dataset.
A Random Forest algorithm is used on each iteration to evaluate the model.
The algorithm is configured to explore all possible subsets of the attributes.
All 13 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 6 attributes gives almost comparable results.
Now the problem with RFE algorithm is the huge amount of time it takes to run (I am talking about 10–12 hours), in order to reduce the time taken we can make use of a cluster (reduced to almost 2 hours, trying checking your task manager during this time).
To tune a predictive model using multiple workers, a separate function is used to “register” the parallel processing technique and specify the number of workers to use.
For example, to use the doParallel package with five cores on the same machine, the package is loaded and them registered (check task manager once, you will notice the difference):library(doParallel)cl <- makePSOCKcluster(5)registerDoParallel(cl)control <- rfeControl(functions=rfFuncs, method=”cv”, number=10)# run the RFE algorithm, once the cluster cl is built then the algorithm is by default run in parallel.
results <- rfe(house[,2:19], house[,1], sizes=c(1:19), rfeControl=control)# summarize the resultsprint(results)# list the chosen featurespredictors(results)# plot the resultsplot(results, type=c(“g”, “o”))# once the job is done, stop the clusterstopCluster(cl)As per the above model output the below 6 variables produce the lowest RMSE:“lat” “long” “sqft_living” “sqft_lot” “yr_built” “view”If you notice that the feature selected in the RFE and the varIMP way are a bit different, different methods will select different subsets of features.
There is likely no “best set” of features just like there is no best model.
Cross-ValidationNow if you have noticed we are using method=’CV’ so next we will see what this CV stands for.
Classic statistical regression metrics (R2, F-statistics, and p-values) are all “insample” metrics — they are applied to the same data that was used to fit the model.
Intuitively, you can see that it would make a lot of sense to set aside some of the original data, not use it to fit the model, and then apply the model to the set aside (holdout) data to see how well it does.
Normally, you would use a majority of the data to fit the model, and use a smaller portion to test the model.
This idea of “out-of-sample” validation is not new, but it did not really take hold until larger data sets became more prevalent; with a small data set, analysts typically want to use all the data and fit the best possible model.
Using a holdout sample, though, leaves you subject to some uncertainty that arises simply from variability in the small holdout sample.
How different would the assessment be if you selected a different holdout sample?Cross-validation extends the idea of a holdout sample to multiple sequential holdout samples.
The algorithm for basic k-fold cross-validation is as follows:1.
Set aside 1/k of the data as a holdout sample.
Train the model on the remaining data.
Apply (score) the model to the 1/k holdout, and record needed model assessment metrics.
Restore the first 1/k of the data, and set aside the next 1/k (excluding any records that got picked the first time).
Repeat steps 2 and 3.
Repeat until each record has been used in the holdout portion.
Average or otherwise combine the model assessment metrics.
The division of the data into the training sample and the holdout sample is also called a fold.
Step-wise RegressionIn our problem, many variables could be used as predictors in a regression.
There is a method called Stepwise regression which sequentially adds or removes features from a model and checks the performance of the model.
Note that, the train() function [caret package] provides an easy workflow to perform stepwise selections using the leaps and the MASS packages.
It has an option named method, which can take the following values:“leapBackward”, to fit linear regression with backward selection“leapForward”, to fit linear regression with forward selection“leapSeq”, to fit linear regression with stepwise selection .
You also need to specify the tuning parameter nvmax, which corresponds to the maximum number of predictors to be incorporated in the model.
For example, you can vary nvmax from 1 to 5.
In this case, the function starts by searching different best models of different size, up to the best 5-variables model.
That is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-variables models.
We’ll use 10-fold cross-validation to estimate the average prediction error (RMSE) of each of the 5 models.
The RMSE statistical metric is used to compare the 5 models and to automatically choose the best one, where best is defined as the model that minimize the RMSE.
Stepwise regression is very useful for high-dimensional data containing multiple predictor variables.
Other alternatives are the penalized regression (ridge and lasso regression) and the principal components-based regression methods (PCR and PLS).
## THIS IS FOR STEPWISE REGRESSIONset.
seed(123)# Set up repeated k-fold cross-validationtrain.
control <- trainControl(method = “cv”, number = 10)# Train the modelstep.
model <- train(price ~.
, data = trainData[,1:14],method = “leapSeq”,tuneGrid = data.
frame(nvmax = 1:13),trControl = train.
model$finalModel)The RMSE with 7 variables is almost similar to the one we found in the beginning with more variables.
Adding more variables, does not necessarily mean we have a better model.
Statisticians use the principle of Occam’s razor to guide the choice of a model: all things being equal, a simpler model should be used in preference to a more complicated model.
Including additional variables generally reduces RMSE and increases R2.
Hence, these are not appropriate to help guide the model choice.
In the 1970s, Hirotugu Akaike, the eminent Japanese statistician, deveoped a metric called AIC (Akaike’s Information Criteria) that penalizes adding terms to a model.
In the case of regression, AIC has the form:AIC = 2P + n log(RSS/n)where p is the number of variables and n is the number of records.
The goal is to find the model that minimizes AIC; models with k more extra variables are penalized by 2k.
AIC, BIC AND MALLOWS CPThe formula for AIC may seem a bit mysterious, but in fact it is based on asymptotic results in information theory.
There are several variants to AIC:AICc: a version of AIC corrected for small sample sizes.
BIC or Bayesian information criteria: similar to AIC with a stronger penalty for including additional variables to the model.
Mallows Cp: A variant of AIC developed by Colin Mallows.
Data scientists generally do not need to worry about the differences among these in-sample metrics or the underlying theory behind them.
How do we find the model that minimizes AIC?.One approach is to search through all possible models, called all subset regression.
This is computationally expensive and is not feasible for problems with large data and many variables.
An attractive alternative is to use stepwise regression about which we learned above, this successively adds and drops predictors to find a model that lowers AIC.
Simpler yet are forward selection and backward selection.
In forward selection, you start with no predictors and add them one-by-one, at each step adding the predictor that has the largest contribution to , stopping when the contribution is no longer statistically significant.
In backward selection, or backward elimination, you start with the full model and take away predictors that are not statistically significant until you are left with a model in which all predictors are statistically significant.
The problem with stepwise methods is that they assess the fit of a variable based on the other variables that were in the model.
Some people use the analogy of getting dressed to describe this problem.
If a stepwise regression method was selecting your clothes, it would decide what clothes you should wear, based on the clothes it has already selected.
Penalized regression is similar in spirit to AIC.
Instead of explicitly searching through a discrete set of models, the model-fitting equation incorporates a constraint that penalizes the model for too many variables (parameters).
Rather than eliminating predictor variables entirely — as with stepwise, forward, and backward selection — penalized regression applies the penalty by reducing coefficients, in some cases to near zero.
Common penalized regression methods are ridge regression and lasso regression.
Stepwise regression and all subset regression are in-sample methods to assess and tune models.
This means the model selection is possibly subject to overfitting and may not perform as well when applied to new data.
One common approach to avoid this is to use cross-validation to validate the models.
In linear regression, overfitting is typically not a major issue, due to the simple (linear) global structure imposed on the data.
For more sophisticated types of models, particularly iterative procedures that respond to local data structure, cross validation is a very important tool.
Ridge and Lasso regressionThis section is taken from this excellent Analytics vidhya article, to know more about the mathematics behind Ridge and Lasso Regression please do go through the link.
Regularization helps to solve over fitting problem in machine learning.
Simple model will be a very poor generalization of data.
At the same time, complex model may not perform well in test data due to over fitting.
We need to choose the right model in between simple and complex model.
Regularization helps to choose preferred model complexity, so that the model is better at predicting.
Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term.
It can be used for many machine learning algorithms.
Both Ridge and Lasso regression uses L2 and L1 regularizations.
Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features.
Here ‘large’ can typically mean either of two things:Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)Large enough to cause computational challenges.
With modern systems, this situation might arise in case of millions or billions of featuresThough Ridge and Lasso might appear to work towards a common goal, the inherent properties and practical use cases differ substantially.
If you’ve heard of them before, you must know that they work by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations.
These are called ‘regularization’ techniques.
The key difference is in how they assign penalty to the coefficients:Ridge Regression:Performs L2 regularization, i.
adds penalty equivalent to square of the magnitude of coefficientsMinimization objective = LS Obj + α * (sum of square of coefficients)2.
Lasso Regression:Performs L1 regularization, i.
adds penalty equivalent to absolute value of the magnitude of coefficientsMinimization objective = LS Obj + α * (sum of absolute value of coefficients)Note that here ‘LS Obj’ refers to ‘least squares objective’, i.
the linear regression objective without regularization.
Now that we have a little idea of how ridge and lasso regression work, lets try to consolidate our understanding by comparing them and try to appreciate their specific use cases.
I will also compare them with some alternate approaches.
Lets analyze these under three buckets:1.
Key DifferenceRidge: It includes all (or none) of the features in the model.
Thus, the major advantage of ridge regression is coefficient shrinkage and reducing model complexity.
Lasso: Along with shrinking coefficients, lasso performs feature selection as well.
Some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.
Traditionally, techniques like stepwise regression were used to perform feature selection and make parsimonious models.
But with advancements in Machine Learning, ridge and lasso regression provide very good alternatives as they give much better output, require fewer tuning parameters and can be automated to a large extent.
Typical Use CasesRidge: It is majorly used to prevent overfitting.
Since it includes all the features, it is not very useful in case of exorbitantly high features, say in millions, as it will pose computational challenges.
Lasso: Since it provides sparse solutions, it is generally the model of choice (or some variant of this concept) for modelling cases where the #features are in millions or more.
In such a case, getting a sparse solution is of great computational advantage as the features with zero coefficients can simply be ignored.
Its not hard to see why the stepwise selection techniques become practically very cumbersome to implement in high dimensionality cases.
Thus, lasso provides a significant advantage.
Presence of Highly Correlated FeaturesRidge: It generally works well even in presence of highly correlated features as it will include all of them in the model but the coefficients will be distributed among them depending on the correlation.
Lasso: It arbitrarily selects any one feature among the highly correlated ones and reduced the coefficients of the rest to zero.
Also, the chosen variable changes randomly with change in model parameters.
This generally doesn’t work that well as compared to ridge regression.
Along with Ridge and Lasso, Elastic Net is another useful techniques which combines both L1 and L2 regularization.
It can be used to balance out the pros and cons of ridge and lasso regression.
I encourage you to explore it further.
Now let us try ridge and lasso regression on our dataset:#Ridge Regression Modeltr.
control <- trainControl(method=”repeatedcv”, number = 10,repeats = 10)lambdas <- seq(1,0,-.
001) #Tuning Parameterset.
seed(123)ridge_model <- train(price~.
grid(alpha=0,lambda=lambdas))ridge_model$resultsvarImp(ridge_model)ridge_preds <- predict(ridge_model,newdata = testData)model_rrm_output <- cbind(testData, ridge_preds)RMSE_rrm <- sqrt(mean((model_rrm_output$price – model_rrm_output$ridge_preds)^2,na.
rm=TRUE))print(RMSE_rrm)Next we will try the Lasso regression and see if we can get any better output from that.
#Lasso Regression Modelset.
seed(123)lasso_model <- train(price~.
0001)))varImp(lasso_model)lasso_preds <- predict(lasso_model,newdata = testData)model_lrm_output <- cbind(testData, lasso_preds)RMSE_lrm <- sqrt(mean((model_lrm_output$price – model_lrm_output$ridge_preds)²,na.
rm=TRUE))print(RMSE_lrm)Keep the RMSE score in mind which we will use to compare model performance further down the line.
Weighted RegressionWeighted regression is used by statisticians for a variety of purposes; in particular, it is important for analysis of complex surveys.
Data scientists may find weighted regression useful in two cases:Inverse-variance weighting when different observations have been measured with different precision.
Analysis of data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
For example, with the housing data, older sales are less reliable than more recent sales.
Now that we have tried different modes of linear regression we can move to other methods and see if we can reduce the RMSE even further.
Non Linear RegressionIn some cases, the true relationship between the outcome and a predictor variable might not be linear.
There are different solutions extending the linear regression model for capturing these nonlinear effects, including:Polynomial regression.
This is the simple approach to model non-linear relationships.
It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.
Fits a smooth curve with a series of polynomial segments.
The values delimiting the spline segments are called Knots.
Generalized additive models (GAM).
Fits spline models with automated selection of knots.
Suppose you suspect a nonlinear relationship between the response and a predictor variable, either by a priori knowledge or by examining the regression diagnostics.
Polynomial terms may not flexible enough to capture the relationship, and spline terms require specifying the knots.
Generalized additive models, or GAM, are a technique to automatically fit a spline regression.
So we will try GAM on our data next.
Generalized Additive ModelsIf you plot latitude against price then you will notice that a polynomial line will fit much better than a straight line, you can perform the same analysis for other variables and check their behavior.
plot(house$lat,house$price)The below regression curve fits much better:thegam<-gam(price ~ s(lat),data=trainData)plot(thegam, residuals=TRUE,shade.
col=”gray80")Running GAM on our data, we have:library(mgcv)lm_gam <- gam(price ~ bedrooms + bathrooms + s(sqft_living) + sqft_lot + floors +waterfront+ view + s(sqft_basement) + condition + yr_built + s(lat) + s(long),data=trainData)prediction_lm <- predict(lm_gam, testData, type=”response”)model_lm_output <- cbind(testData, prediction_lm)# RMSE valueRMSE_lm <- sqrt(mean((model_lm_output$price-model_lm_output$prediction_lm)²,na.
rm=TRUE))print(RMSE_lm)The term s(SqFtTotLiving) tells the gam function to find the “best” knots for a spline term.
You can see that the RMSE score is better than normal regression.
Random ForestRandom Forest is an algorithm capable of performing both regression and classification tasks.
In the case of regression, it operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction of the individual trees.
We will study the Random Forest algorithm in details later.
As we did before in the linear model, we split the dataset into train and validation sets.
After that, we define the variables included in the model and we run it.
Note we will be taking all the variables in this model:library(randomForest)set.
seed(2000)tree_rf <- price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + condition + grade + sqft_above + yr_built + zipcode + lat + long + sqft_living15 + sqft_lot15model_rf <- randomForest(tree_rf, data=trainData, ntree=100,proximity=T, do.
trace=10)varImpPlot(model_rf)plot(model_rf, main = “”)# Making predictionsprediction_rf <- predict(model_rf, testData)model_rf_output <- cbind(testData, prediction_rf)# RMSERMSE_rf <- sqrt(mean((model_rf_output$price – model_rf_output$prediction_rf)²,na.
rm=TRUE))As we can see above, our RMSE for the Random Forest model is 0.
The RMSE value is the lowest that we have seen so far.
Gradient BoostingGradient Boosting is one of the most powerful techniques for machine learning problems.
It is an ensemble learning algorithm which combines the prediction of several base estimators in order to improve robustness over a single estimator.
As Random Forest, it is also a tree classifier model.
We will learn more about GBM later.
The below code runs a GBM on our data:fitControl <- trainControl(method = “cv”, number = 50)tune_Grid <- expand.
depth = 2, n.
trees = 500, shrinkage = 0.
minobsinnode = 10)set.
seed(825)tree_gbm <- price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + condition + grade + sqft_above + yr_built + zipcode + lat + long + sqft_living15 + sqft_lot15model_gbm <- train(tree_gbm, data = trainData, method = “gbm”, trControl = fitControl, verbose = FALSE)print(model_gbm)prediction_lm <- predict(model_gbm, testData, type=”response”)model_lm_output <- cbind(testData, prediction_lm)# RMSE valueRMSE_lm <- sqrt(mean((model_lm_output$price-model_lm_output$prediction_lm)²,na.
rm=TRUE))print(RMSE_lm)This is also good but not as good as the Random Forest results.
To summarize the results of all the models that we have seen so far in this article:The result is clear Random Forest wins the day.
But please note that it is not necessarily the fact that RF will perform better everytime, with different dataset different model will outperform RF and it is our job to find, tune and use the model which gives the best result.
The article was long but hope you have learnt something from it.
Please go through the below references they contain much more details on topics which I could not capture here.
Until next time.
com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/Books:An Introduction to Statistical Learning with Applications in R — Gareth James | Daniela Witten | Trevor Hastie | Robert TibshiraniThe 100 page machine learning book — Andriy BurkovDiscovering statistics Using r — Andy Field | Jeremy Miles | Zoë FieldPractical Statistics for Data Scientists — Peter Bruce | Andrew BrucePractical Data Science with R — NINA ZUMEL | JOHN MOUNT.