Regression Analysis: Linear RegressionPart I of IIISung KimBlockedUnblockFollowFollowingDec 16, 2018http://dataaspirant.
IntroductionRegression analysis is perhaps the most fundamental statistical modeling technique which conduced to the wave of machine learning.
Although its popularity is credited to the simplicity of use in model building and interpretation, this article will get on to the tough sides where more mathematics exist.
1) DefinitionsSimple linear regression has a form Y = aX + b where Y and X are called dependent and independent variables.
Multiple linear regression extends the polynomial with more number of independent variables such as Y= aX + bX1 + c where Y = Weight, X = Height, and X1 = Gender.
If the goal is a prediction or error reduction, linear regression can be used to fit a predictive model to an observed dataset.
Moreover, the technique can be applied to quantify the strength of the relationship if the goal is to explain variation in Y that can be attributed to variation in the Xs.
A note about sample size.
The sample size rule of thumb in linear regression analysis is that at least 20 cases per independent variable are required.
2) MathematicsWe estimate the conditional expectation of the dependent variables given the independent variables — that is, the average value of Y when Xs are fixed.
Therefore, the main task for a researcher in this process would be finding the coefficients βs which produces a line of best fit.
com/2013/08/07/assumptions-for-linear-regression/Least squares is a statistical method used to determine a line of best fit by minimising the sum of squared errors.
Having squared the values is purposed to avoid negative numbers.
Bear in mind that zero sum of squares (although it is realistically impossible) is not a sign of good model as it will only lead to overfitting.
htmlThe above equations can be expressed in the form of a matrix which looks alike the structure of the dataset retrieved from the SQL Database.
This notation is extremely important.
If X^TX, where T denotes the transpose of a matrix, is singular (meaning has got no inverse matrix) the unique LSEs cannot be obtained.
Mathematical AssumptionsGenerally speaking, five assumptions need to be satisfied for a model validity.
This chapter will talk through ‘what they are’ and ‘how to measure’ in the best simplest words.
1) LinearityE[Y] which denotes the expected value of dependent variables should be a straight-line function of each independent variable, holding the others fixed.
Draw a scatter plot for residuals and independent variables to investigate the existence of linearity.
A relationship may be linear, quadratic, or even quintic.
2) No or Little MulticollinearityMulticollinearity refers to a situation in which two or more independent variables in a multiple regression model are highly linearly related.
This often reduces the power of a model to identify independent variables that are statistically significant.
Before building a model, draw a correlation matrix and manipulate the dataset with elimination, addition, or replacement if needed.
This process would not be recommended when the number of independent variables or size of dataset is heavy.
After building a model, run an experiment with the Variance Inflation Factor which quantifies the severity of multicollinearity in an ordinary least squares regression analysis.
The details will be provided later in this article.
3) Multivariate NormalityThe residuals should follow the normal distribution with mean zero and constant variance.
QQ plots would be the easiest and most popular method of diagnosis.
Besides, the normality can be checked by the skewness and kurtosis of the distribution.
4) Homoscedasticity (A.
Homogeneity of variance)The residuals should be spread (relatively) equally along the ranges of predictors.
Heteroscedasticity contraries to the terminology.
Whilst least squares gives equal weight to all observations, the cases with larger disturbances have more “pull” than other observations when heteroscedasticity is present.
Contaminated standard errors due to heteroscedasticity can lead to incorrect conclusions about significance tests and confidence intervals.
Draw a scatter plot for fitted values and square root of standardised residuals.
5) No autocorrelationLinear regression demands little or no autocorrelation in the residuals, hence Y|X = xi and Y|X = xj where i ≠ j should not be correlated.
Otherwise, the problem we need to solve would become the Time Series.
The stock market where the current price is not independent of the previous prices is an example of autocorrelation.
Durbin-Watson Test can be used to examine the independence of the data, but will not be covered in the article as I think it is time-consuming and also less important than others.
Model Building in RI have used the dataset which contains the details of 2,201 flights.
The descriptions of each variable are as below.
1) Datasetsschedtime : the scheduled time of departure (using the 24-hour clock)carrier : the two-letter code indicating which airline operated the flightdeptime : the actual departure timedest : the three-letter code indicating the destination airportdistance : the length of the flight (in miles)flightnumber : the flight numberorigin : the three-letter code indicating the departure airportweather : the weather of the day (0 = normal, 1 = severe)dayweek : the day of the week (1 = Monday, .
, 7 = Sunday)daymonth : the day of the monthv1_data <- read.
csv”, header = TRUE, sep = “,”)head(v1_data, 3); tail(v1_data, 3)str(v1_data)summary(v1_data)sapply(v1_data, function(x) length(unique(x)))3.
2) Relationships among variables# Correlationinstall.
Correlation(v1_data[, – c(1, 2, 4, 7)], hist = TRUE, title = "Correlations among variables")corr <- round(cor(v1_data[, – c(1, 2, 4, 7)]), 3);corrPlotted numeric variables onlyrounded up to three distinct places3.
3) Multiple Linear Regression Model Note that the variable weather is stored as ‘Numeric’ whilst it has binary outputs.
This should be turned into ‘Factor’ just like other categorical variables such as ‘carrier’, ‘dest’, and ‘origin’.
The adjustment will allow R (by default) to identify a statistically significant category within the variable of the model.
The variable flightnumber is left as it is to avoid having too many categories in the single variable.
# Linear Regression Modellm_eg_1 <- lm(deptime ~ schedtime + carrier + dest + distance + flightnumber + origin + factor(weather) + dayweek + daymonth, data = v1_data)summary(lm_eg_1)R will tell you which one is statistically significantNo indication when the data type is Numeric3.
4) Model InterpretationThe below statistical figures are what we are looking for.
R² and Adjusted R²=> 0.
97 is a good indication of the model.
However, the definition of ‘good’ would depend on a circumstance.
70 is fairly a good point to start with.
=> R² increases correspondingly to the number of independent variables, hence always take Adjusted-R² into the consideration.
htmP-value ( P > |t|)=> should be less than 0.
05 (or 0.
=> schedtime is the primary factor and we want to tune the coefficients of other variables to make the model better.
=> dayweek, and daymonth seem to be less important.
b_i = beta_iF-statistics=> F = ‘Variation between sample means’ / ‘Variation within the samples’ is one of the rare test statistics which the value is not expected to be less than 0.
5) Model Diagnosis VIF = 1/1-R² ≤ 4 is usually accepted as a sign of no multicollinearity, but the varied standards are applied to the different type of industry or project.
Financial organisations such as banks may consider VIF ≤ 2 as each shot they make is a million bullet.
You can find more details in this wonderful blog about necessity of fixing multicollinearity.
# Multicolinearity Checkinstall.
packages("car")library(car)vif(lm_eg_1)lm_eg_2 <- lm(deptime ~ schedtime + carrier + dest + flightnumber + origin + factor(weather) + dayweek + daymonth, data = v1_data)vif(lm_eg_2)summary(lm(lm_eg_2))$r.
squaredlm_eg_3 <- lm(deptime ~ schedtime + carrier + dest + origin + factor(weather) + dayweek + daymonth, data = v1_data)vif(lm_eg_3)summary(lm(lm_eg_3))$r.
squaredThere are various packages for VIFThe regression model #3 is chosen to be diagnosed by the above calculation.
# Chosen model diagnosispar(mfrow = c(2,2))plot(lm_eg_3)Diagnosing the model by visualisationPlot 1.
Residuals vs Fitted => Linearity ✔Plot 2.
Normal Q-Q => Normality ✔Plot 3.
Scale-Location => Homoscaidscitiy ✔Plot 4.
Cook’s distance => Clean Data ✘It is clear that there are outliers.
However, since not all outliers are influentials writing a few more codes would be worth doing so.
# Cook's Distance (Influential Observations)par(mfrow = c(1,2))cutoff <- 4/((nrow(v1_data) – length(lm_eg_3$coefficients) – 2)) plot(lm_eg_3, which = 4, cook.
levels = cutoff)influencePlot(lm_eg_3, id.
method = "identify", main = "Influence Plot", sub = "Circle size is proportial to Cook's Distance")1,271, 1,607 stood up as highly influential data points.
Make sure removing the data points is not always the best to do so.
It might be replaced by the mean or median.
# remove influential data pointsv2_data <- v1_data[-c(1271, 1607), ]lm_eg_4 <- lm(deptime ~ schedtime + carrier + dest + origin + factor(weather) + dayweek + daymonth, data = v2_data)3.
1) Model Diagnosis (Supplement)Box-Cox Transformations is employed to know if the line of the Q-Q plot is really straight, and if it is not, calculate the most appropriate value to transform the data to meet the normality condition.
com/box-cox-transformation/The summary() function provided a nice straight Q-Q plot but will see if there is a gap for improvement.
Write the box.
cox() function in R to find out the best form of transformation.
# Box-Cox Transformationlibrary(MASS)boxcox(lm_eg_4, plotit = TRUE)boxcox(lm_eg_4, plotit = TRUE, lambda = seq(0.
2, by = 0.
1))boxcox(lm_eg_4, plotit = TRUE, lambda = seq(0.
1, by = 0.
1))lambda <- boxcox(lm_eg_4, plotit = TRUE, lambda = seq(0.
1, by = 0.
1))$x lik <- boxcox(lm_eg_4, plotit = TRUE, lambda = seq(0.
1, by = 0.
1))$ybc <- cbind(lambda, lik)sorted_bc <- bc[order(-lik), ]head(sorted_bc, n = 10)lm_eg_4_cox <- lm((((deptime ^ 0.
9646465) – 1) / 0.
9646465) ~ schedtime + carrier + dest + origin + factor(weather) + dayweek + daymonth, data = v2_data)par(mfrow = c(1, 2))qqnorm(resid(lm_eg_4), main = "Original"); qqline(resid(lm_eg_4), col = "red", lwd = 1)qqnorm(resid(lm_eg_4_cox), main = "Transfromed"); qqline(resid(lm_eg_4_cox), col = "red", lwd = 1)# Another way of visualising the normality of the modelqqPlot(rstandard(lm_eg_4), main = "Original")qqPlot(rstandard(lm_eg_4_cox), main = "Transformed")# Statistical Test to check the normality of the modelshapiro.
test(resid(lm_eg_4_cox))It is clear that the model has already satisfied the normality condition, and the transformation did not really make an improvement.
(If you wonder what W means please refer to this).
Model ImprovementThe ultimate model, in my opinion, should contain the minimum number of variables while its performance is still on a certain level.
Therefore, the model development is concentrating on subtracting the independent variables.
1) Model DevelopmentWe first resolve the complexity of the model and will employe the idea of hypothesis test to examine the performance of the new model.
A null hypothesis is written as H0: A nested model is statistically significantly better than the current model.
Akaike Information Criterion=> The smaller the value the better it is.
We will compare the AICs among the models to find the best one.
=>The stepwise method can be used for the variable selection.
Forward selection, Backward elimination, and Bidirectional elimination are the three main approaches.
# Backward stepwise functionstepAIC(lm_eg_4_cox, direction = "backward", steps = 10)ANOVA (Chi-squared test)=> It will test whether a reduction in the residual sum of squares is statistically significant or not.
In fact, running ANOVA for a comparison of two models is nothing more than running the Chi-square test.
# Nested Modellm_eg_4_nested <- lm((((deptime^0.
9646465) – 1)/ 0.
9646465) ~ schedtime + carrier + origin + factor(weather), data = v2_data)# Anovaanova(lm_eg_4_cox, lm_eg_4_nested)4.
2) More SelectionThere are diversity in methods of model selection.
One would select a model with a larger adjusted R², smaller Mallows Cp, Smaller BIC, and smaller RSS.
A cross-validation would be more machine learning way which is not the topic of interest.
# R^2, Adjusted_R^2, Mallows_Cp, BIC, RSSinstall.
packages("leaps")library(leaps)all <- regsubsets(deptime ~ schedtime + carrier + dest + distance + flightnumber + origin + factor(weather) + dayweek + daymonth, data = v1_data)info <- summary(all)cbind(round(cbind(R_sq = info$rsq, Adj_R_sq = info$adjr2, Mallows_Cp = info$cp, BIC = info$bic, RSS = info$rss), 3))The very first column indicates the number of variables in the model4.
3) Make a prediction using the final modelI have not split the dataset into training and test set as the most of the topics covered by the article was the mathematical assumptions and statistical diagnosis.
# Predictionpredictions = predict.
Preview of Logistic RegressionThere may be a case where a dependent variable is binary such as pass/fail, win/lose, alive/dead or healthy/sick, and the ordinary linear regression analysis is not the best tool.
Whereas the Logistic regression analysis is the go-to method for binary classification problems.
In the next chapter of the article, more complex mathematics with a new terminology will be introduced such as Indicator Variable and Logistic Function.
So if you enjoyed reading this one please keep your eyes on my next article.
— END —.