Regression Analysis: Generalised Linear ModelPart II of IIISung KimBlockedUnblockFollowFollowingJan 10This article requires basic knowledge of Linear Regression and is a pre-requisite for Logistic Regression in the final chapter of series.
The term dependent and independent variable will be interchangeably used with a response and explanatory variable.
IntroductionThere are many relationships that cannot adequately be summarised by an ordinary linear equation for two major reasons:I) dependent variables are not continuous nor normal, and so, discrete.
II) dependent variables are not linearly explained in nature.
Having known a bit more than linear regression would deepen the ability to handle dataset containing categorical response variable that relies on categorical and/or numerical explanatory variable.
1) DefinitionThe Generalised Linear Model (GLM) by McCullagh and Nelder (1982, 2nd edition 1989) is a flexible generalisation of an ordinary linear regression which allows for response variables that have error distribution models other than a normal distribution.
Each outcome yi is assumed to be generated from a large range of probability distributions that includes the Normal, Binomial, Poisson, Gamma distributions, and etc, with mean μi which depends on (often nonlinear) function of xTiβxiTβ.
The terminology may cause a confusion but the word ‘linear’ in GLM does not necessarily require the linearity of a model.
In fact, linear regression is just a special case that holds linearity.
2) PreliminariesA few terms would be worth pre-defined to look at three components of GLM.
The Exponential family is member an exponential family if its probability density function (or probability function, if discrete) can be written in the form;f(y; theta, phi)The parameter “theta” is called natural or canonical parameter if b(theta) is the identity function.
The parameter “phi” is usually assumed known.
If it is unknown then it is often called the nuisance parameter.
Maximum likelihood is a method of estimating the parameters of a statistical model, given observations.
The class of Exponential Dispersion Models (EDM) which plays an important role in GLMs is a set of probability distributions that represents a generalisation of the natural exponential family.
2) Assumptions for GLMThe data y1, y2, …, yn are independently distributed.
The homogeneity of variance does not need to be satisfied.
In fact, it is not even possible in many cases given the model structure, and overdispersion (when the observed variance is larger than what the model assumes) may be present.
Errors need to be independent but not normally distributed.
It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters and thus relies on large-sample approximations.
Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
MathematicsThe GLM consists of three components;Random Component — specifies the conditional distribution of the response variable Y, given the values of the explanatory variables Xs.
Systematic Component — specifies the linear combination that explanatory variables Xs are fit into, by estimating β in various ways.
Link Function — specifies the link between random and systematic components.
1) Random Component This component provides an idea what type of regression analysis would be best utilised.
For example, linear regression is the most commonly used method of analysis for the continuous and unbounded outcome Y that likely follows the Normal distribution.
Likewise, the Binomial distribution for a number of heads out of N coin tosses, and the Poisson distribution for a number of events occurs in a given time interval would be used.
(For more details, please refer to this).
Many of the most important distributions in statistics could be expressed in the following common “linear-exponential” form as part of the Exponential Family;2.
2) Systematic Component (a.
Structural Component)The Linear Predictor η “eta” is the quantity which incorporates the information about the independent variables into the model by unknown parameters β, and the coefficients for the linear combination are represented as the matrix of independent variables X;xij are pre-specified functions of the independent variables and therefore may include quantitative independent variables, transformations of quantitative independent variables, polynomial regressors, dummy regressors, interactions, and so on.
The parameters βs are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques, and more details will be covered in the next chapter of series.
3) Link FunctionA smooth and invertible link function g(·) transforms the expectation of the response variable to the linear predictor;This can be written in a form of an inverse function g−1(·) due to its invertibility, and is called the mean function;There always exists a well-defined canonical link function which is derived from the exponential of the response’s density function.
However, In some cases, it makes sense to try to match the domain of the link function to the range of the distribution function’s mean or use a non-canonical link function for algorithmic purposes.
Having decided kind of link function for the GLM depends on the random component as it has been mentioned above.
ExamplesSo far we know that, In general, the mean μ, of the distribution depends on the independent variables X, through:where E[Y] is the expected value of Y.
The variance is denoted as a function V of the mean:V could follow the exponential family distribution, but it may simply be that the variance is a function of the predicted value.
1) Mathematical ImplementationThe source of examples below is provided in references.
2) Computational Implementation# Binomialglm_m1 <- glm(y ~ x1 + x2 + .
+ xn, family = "binomial", data = data_set1)# Gaussianglm_m2 <- glm(y ~ x1 + x2 + .
+ xn, family = "gaussian", data = data_set2)# Gammaglm_m3 <- glm(y ~ x1 + x2 + .
+ xn, family = "gamma", data = data_set3).
etchelp() function in R will provide further details.
help(glm) and help(family) for other modeling options, and allowable link functions for each family, respectively.
ConclusionIn summary, the GLM generalises linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
This offers various statistical models, including linear regression, logistic regression, and Poisson regression.
However, the real situation is far much more complicated.
Consider a case of simple coin tosses with all heads which contradicts the Bernoulli distribution.
The result of GLM would not be accurate and the Bayesian approach may need to be applied to solve the problem.
1) Pros and Cons of GLMsWe do not need to transform the response Y to have a normal distribution.
The choice of link is separate from the choice of random component thus we have more flexibility in modeling.
If the link produces additive effects, then we do not need a constant variance.
The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.
All the inference tools and model checking that we will discuss for log-linear and logistic regression models apply for other GLMs too; e.
, Wald and Likelihood ratio tests, Deviance, Residuals, Confidence intervals, Overdispersion.
There is often one procedure in a software package to capture all the models listed above, e.
PROC GENMOD in SAS or glm() in R, etc… with options to vary the three components.
But there are some limitations of GLMs too;A linear function, e.
can have only a linear predictor in the systematic component.
Responses must be independent.
2) NextAreas have been missed, and so, will be covered are;Estimation of the model parameters.
Prediction with Confidence Interval.
3) Source of Knowledgehttps://www.
pdf -> bloody good reference.