Tips & Tricks in Multiple Linear RegressionGathered methods to analyse data, diagnose models and visualize resultsJason YipBlockedUnblockFollowFollowingFeb 24This analysis was a project which I decided to undertake for the Regression Analysis module in school.

I have learnt and gathered several methods you can use in R to take your depth of analysis further.

As usual, I always learn the most discovering on my own.

DataResponse variable: Chance.

of.

AdmitPredictors: GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, ResearchLink to csvLibrarieslibrary(dplyr);library(ggplot2);library(GGally);library(vioplot);library(corpcor);library(ppcor);library(mctest);library(ggfortify);library(lmtest);library(MASS);library(car);library(DAAG);library(jtools);library(relaimpo);Descriptive statssummary(df)Distribution plotspar(mfrow=c(4, 2))colnames = names(df)for(name in colnames) { vioplot(df[name], horizontal=TRUE, col='gold', lineCol='gold', lty=0, colMed='floralwhite', yaxt='n',rectCol='dodgerblue4') title(main=name)}There is no extreme skew for the variables.

this makes the confidence intervals for estimating parameters for our predictors and estimating the mean response more meaningful.

Check for 1) Linearity between DV and each of the IVs 2) Multicollinearity between IVsggpairs(df, progress=FALSE)From the last row, we can observe that most of the IVs seem to have a linear relationship with our response variable except for the binary variable Research.

Therefore the assumption for linearity between DV and each of IVs hold.

The pairwise correlation for all variables are fairly high.

This seems to violate the Multiple Linear Regression’s assumption of No Multicollinearity.

Partial Correlation coefficientsTo account for confounding effects from the other predictors.

pcorr = as.

data.

frame(cor2pcor(cov(df)))names(pcorr) = names(df)rownames(pcorr) = names(df)pcorr = format(pcorr, digits=1)print.

data.

frame(pcorr)The partial correlation coefficients suggest otherwise, that there is less multicollinearity with only GRE.

Score & TOEFL.

Score having a value > 0.

4.

Partial correlation between CGPA and our response variable Chance.

of.

Admit is fairly high but it does not violate the “No Multicollinearity between its IVs assumption” of MLR.

Using Individual Multicollinearity Diagnostics Measuresimcdiag(df[,1:7],df$Chance.

of.

Admit)All the predictors have a VIF (=1/(1-R²)) value of <5 which indicates that the multicollinearity is not so problematic.

Fitting MLRfit = lm(Chance.

of.

Admit ~ .

, data=df)summary(fit)Fit: Chance.

of.

Admit = -1.

28 + 0.

00186(GRE.

Score) + 0.

00278(TOEFL.

Score) + 0.

00594(University.

Rating) + 0.

00159(SOP) + 0.

0169(LOR) + 0.

118(CGPA) + 0.

0243(Research) (3s.

f.

)This indicates that on average, a unit increase in GRE.

Score/TOEFL.

Score/University.

Rating/SOP/LOR/CGPA/Research increases Chance.

of.

Admit by 0.

00186/0.

00278/0.

00594/0.

00159/0.

0169/0.

118/0.

0243, while holding all other variables constant.

The p-value for the F-statistic is <2.

2e-16, indicating that we can reject the null hypothesis that the intercept-only model is the same fit as the MLR model even at alpha=0.

001.

Therefore, the MLR model is highly statistically significant at the 0.

01 significance level.

The Adjusted R-squared: 0.

8194 is high which suggests that the model is a good fit.

The coefficients for GRE.

Score, TOEFL.

Score, LOR, CGPA, Research are statistically significant at alpha=0.

01 where the respective pvalues < 0.

01 as we reject the null that their coeffs is 0 at the 0.

01 significance level.

The coefficients for University.

Rating (0.

118) and SOP (0.

728263) are >0.

01 and we fail to reject the null that their coeffs is 0 at the 0.

01 significance level.

Model diagnosticsautoplot(fit)(1) Residuals vs FittedThe blue line (average value of the residuals at each value of fitted value) is nearly flat.

This shows that there is no clear non-linear trend to the residuals.

The residuals appear to be randomly spread out, but it converges when near the higher fitted values.

This appears to be a decrease in variance and it violates the Homoscedasticity assumption of the MLR.

bptest(fit)Using the Breusch-Pagan test, we can reject the null hypothesis at the 0.

05 significance level that variance of the residuals is constant and infer that heteroscedasticity is present.

Therefore, this makes our coefficient estimates less precise and increases the likelihood that the estimates are further from the true population value.

(2) Normal Q-Q (Quantile-Quantile Plot)The residuals seem to deviate a lot from the diagonal line in the lower tail.

The distribution of residuals is skewed to the left.

This suggests that the assumption of Normality in the Residuals by the MLR model is violated.

Transforming response variable using Box-Cox Power transformation to make it normal and handle heteroscedasticitybc = boxcox(Chance.

of.

Admit ~ .

, data=df);The procedure identifies an appropriate exponent (Lambda = l) to use to transform data into a “normal shape.

The Lambda value indicates the power to which all data should be raised and it is suggested to use lambda=2.

lambda = bc$x[which.

max(bc$y)]powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") { boxcoxTrans <- function(x, lam1, lam2 = NULL) { # if we set lambda2 to zero, it becomes the one parameter transformation lam2 <- ifelse(is.

null(lam2), 0, lam2) if (lam1 == 0L) { log(y + lam2) } else { (((y + lam2)^lam1) – 1) / lam1 } } switch(method , boxcox = boxcoxTrans(y, lambda1, lambda2) , tukey = y^lambda1 )}# re-run with transformationbcfit <- lm(powerTransform(Chance.

of.

Admit, lambda) ~ .

, data=df)summary(bcfit)The Adjusted R-squared increased from 0.

8194 to 0.

8471 while the predictors remain significant.

However, this model is less interpretable and we want our models to be parsimonious as possible.

We will explore more models later on.

(3) Residuals vs Leveragecooksd <- cooks.

distance(fit)sample_size <- nrow(df)plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")abline(h = 4/sample_size, col="red")text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red")This helps us to find influential outliers.

They are points above the dashed line which are not approximated well by the model (has high residual) and significantly influences model fit (has high leverage).

By considering Cook’s D > 4/sample size criterion, we identify influential outliers to remove.

Re-fit MLR with outliers removedinfluential = as.

numeric(names(cooksd)[(cooksd > (4/sample_size))])df2 = df[-influential, ]fit2 = lm(Chance.

of.

Admit ~ .

, data=df2)summary(fit2)By removing the highly influential outliers, we refitted the model on the filtered data and the Adjusted R-squared increased to 0.

8194 to 0.

8916 without introducing complexity to the model.

Fitting the model using a function of the response variablefit3 = lm(exp(Chance.

of.

Admit) ~ .

, data=df2)summary(fit3)By regressing the exponent of our response on the predictors, we got an increase in Adjusted R-squared from 0.

8916 to 0.

9023 while the predictors still remain significnant.

Accounting for interactions by adding Interaction termfit4 = lm(exp(Chance.

of.

Admit) ~ GRE.

Score*University.

Rating+TOEFL.

Score+Research+SOP+LOR+CGPA, data=df2)summary(fit4)Interaction arises as the relationship between Chance.

of.

Admit and the IVs: GRE.

Score and University.

Rating is affected by the interaction between the GRE.

Score & University.

Rating.

This makes it hard to predict the consequence of changing the value of GRE.

Score & University.

Rating without controlling for this interaction.

The model shows a significant interaction between GRE.

Score & University.

Rating as the p-value=0.

000799 < 0.

001 and is significant at the 0.

001 significance level.

Comparing nested models with ANOVAanova(fit3, fit4)The first order model is nested within the interaction model.

By using ANOVA to compare the simpler first order model vs the more complex model with interaction term, the p-value=0.

0007995 is <0.

001.

The null hypothesis that the reduced simpler model is adequate is rejected at the 0.

001 significance level.

Therefore, the complex model did significantly improve the fit over the simpler model.

Drop insignificant predictor SOPfit5 = lm(exp(Chance.

of.

Admit) ~ GRE.

Score*University.

Rating+TOEFL.

Score+Research+LOR+CGPA, data=df2)summary(fit5)Previously, SOP was insignificant at the 0.

05 significance level and even after removing it, the model’s Adjusted R-squared is still 0.

904.

Variable selection using stepwise model selection by AICstep <- stepAIC(fit5, direction="both")A model with fewer parameters is to be preferred to one with more.

AIC considers both the fit of the model and the number of parameters used.

Having more parameters result in penalty.

AIC helps to balance over- and under-fitting.

The stepwise model comparison iteratively adds/removes variables one at a time and compares the AIC.

The lowest AIC is selected for the final model.

step$anovaIn our case, there no further addition or removal of variables required by AIC.

Relative feature importancecalc.

relimp(fit5,type="lmg", rela=TRUE)Relative importance is measured by an algorithm by Lindemann, Merenda and Gold (lmg; 1980) which decomposes total R-squared and observe the increase in R-squared by adding the predictors sequentially.

The order of adding predictors matters and therefore, the algorithm takes the average of the R-squared across all orderings.

Relative importance is measured by an algorithm by Lindemann, Merenda and Gold (lmg; 1980) which decomposes total R-squared and observe the increase in R-squared by adding the predictors sequentially.

The order of adding predictors matters and therefore, the algorithm takes the average of the R-squared across all orderings.

The features are ranked in this order with highest relative importance first: GRE.

Score, CGPA, University.

Rating, TOEFL.

Score, LOR, Research and GRE.

Score*University.

Rating.

K-Fold cross-validation results on final modelcv_new = CVlm(data=df2, fit5, m=3, printit=FALSE)attr(cv_new, "ms")[1] 0.

007749426Each of the k-fold model’s prediction accuracy isn’t varying too much for any one particular sample, and the lines of best fit from the k-folds don’t vary too much with respect the the slope and level.

The average mean square error of the predictions for 3 portions is 0.

00775.

The value is low and represents a good accuracy result.

95% CIs for every IV’s estimatesexport_summs(fit5, error_format = "[{conf.

low}, {conf.

high}]", digits=5)plot_summs(fit5)Individual CI plotseffect_plot(fit4, pred = CGPA, interval = TRUE, plot.

points = TRUE)I hope this has helped improve your analysis one way or another.

Please do not take any of it as a perfect example or as entirely correct and accurate as I am still learning as well.

This has certainly liven up my otherwise dull module ????.

Link to notebookDiscuss further with me on LinkedIn or via jasonyip184@gmail.

com!.