Interpret R Linear/Multiple Regression output (lm output point by point), also with Python

because getting and cleaning data, then data wrangling is almost 60–70% of any data science or machine learning assignment.

Know your datalibrary(alr3)Loading required package: carlibrary(corrplot)data(water) ## load the datahead(water) ## view the data Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM1 1948 9.

13 3.

58 3.

91 4.

10 7.

43 6.

47 542352 1949 5.

28 4.

82 5.

20 7.

55 11.

11 10.

26 675673 1950 4.

20 3.

77 3.

67 9.

52 12.

20 11.

35 661614 1951 4.

60 4.

46 3.

93 11.

14 15.

15 11.

13 680945 1952 7.

15 4.

99 4.

88 16.

34 20.

05 22.

81 1070806 1953 9.

70 5.

65 4.

91 8.

88 8.

15 7.

41 67594filter.

water <- water[,-1] ## Remove unwanted year # Visualize the data library(GGally)ggpairs(filter.

water) ## It's multivaribale regaressionLM magic begins, thanks to RIt is like yi = b0 + b1xi1 + b2xi2 + … bpxip + ei for i = 1,2, … n.

here y = BSAAM and x1…xn is all other variablesmlr <- lm(BSAAM~.

, data = filter.

water)summary(mlr)# Output Call:lm(formula = BSAAM ~ .

, data = filter.

water)Residuals: Min 1Q Median 3Q Max -12690 -4936 -1424 4173 18542 Coefficients: Estimate Std.

Error t value Pr(>|t|) (Intercept) 15944.

67 4099.

80 3.

889 0.

000416 ***APMAM -12.

77 708.

89 -0.

018 0.

985725 APSAB -664.

41 1522.

89 -0.

436 0.

665237 APSLAKE 2270.

68 1341.

29 1.

693 0.

099112 .

OPBPC 69.

70 461.

69 0.

151 0.

880839 OPRC 1916.

45 641.

36 2.

988 0.

005031 ** OPSLAKE 2211.

58 752.

69 2.

938 0.

005729 ** —Signif.

codes: 0 ‘***’ 0.

001 ‘**’ 0.

01 ‘*’ 0.

05 ‘.

’ 0.

1 ‘ ’ 1Residual standard error: 7557 on 36 degrees of freedomMultiple R-squared: 0.

9123 F-statistic: 73.

82 on 6 and 36 DF, p-value: < 2.

2e-16Output ExplainedResidualsNormally it gives a basic idea about difference between the observed value of the dependent variable (Y) and the predicted value (X), it gives specific detail i.

e.

minimum, first quarter, median, third quarter and max value, normally it does not used in our analysisCoefficients-InterceptWe can see a all the remaining variable comes with one more row ‘Intercept’, Intercept is giving data when all the variables are 0 so all the measure done without considering any variable, this is again not much used in normal cases, it’s average value of y when x = 0# Estimate Std.

Error t value Pr(>|t|) # (Intercept) 15944.

67 4099.

80 3.

889 0.

000416 ***Coefficient-EstimateThis is a one unit increase in X then expected change in Y, in this case one unit change in OPS LAKE then 2211.

58 unit change in BSAAMCoefficient-Std.

ErrorThe standard deviation of an estimate is called the standard error.

The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value.

The standard error of the coefficient is always positive.

Low value of this error will be helpful for our analysis, also used for checking confidence intervalCoefficient-t valuet value = estimate/std errorhigh t value will be helpful for our analysis as this would indicate we could reject the null hypothesis, it is using to calculate p valueCoefficient Pr(>|t|)individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y.

Lower the p value allow us to reject null hypothesis.

all type of errors (true positive/negative, false positive/negative) are come to picture if we wrongly analysis p value.

Asterisks mark aside p value define significance of value, lower the value have high significance# Signif.

codes: 0 '***' 0.

001 '**' 0.

01 '*' 0.

05 '.

' 0.

1 ' ' 1Residual standard errorResidual standard error: 7557 on 36 degrees of freedomIn normal work, average error of a model, how well our model is doing to predict the data on averageDegree of freedom is like no of data point taken in consideration for estimation taking parameter in account, Not sure but in this case, we total have 43 data point and 7 variable so removed 7 data points (43–7) = 36 degree of freedomMultiple R-squared and Adjusted R-squaredMultiple R-squared: 0.

9123Its always between 0 to 1, high value are better Percentage of variation in the response variable that is explained by variation in the explanatory variable, this is use to calculate how well the model is doing to explain the things, when we increase no of variable then it will also increase and there are no proper limit to define how much we can increase.

We are taking dusted value in which we does not take all variables, only significant variable are considered in adjusted R squaredF-statisticF-statistic: 73.

82 on 6 and 36 DFThis is showing relationship between predictor and response, higher the value will give more reasons to reject null hypothesis, its significant of overall model not any specific parameterDF — Degree of Freedomp-valuep-value: < 2.

2e-16Overall p value on the basis of F-statistic, normally p value less than 0.

05 indicate that overall model is significantSo the PythonI am using OLS (Ordinary least squares) approach but the same can be produced using SciPy which gives more standard result.

import pandas as pdimport scipy.

stats as statsfrom statsmodels.

formula.

api import olsmlr = ols("BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM", df).

fit()print(mlr.

summary())Most the parameters are matching with R output and the rest of parameters can be used for next research work :)All the description is based on general perceptions, Please let me know if something wrong and your feedback is highly welcomed.