Analytics and Visualization of Big Data: General liner Models (GLM)

General liner Models (GLM) is a popular classification algorithm. In the GLM, a random variable Y is assumed to follow a distribution in the exponential family. In this model, the GLM generalizes two components (stochastic and systematic components) and a link relationship function between them.

(yi│xi )~N(x'iβ,σ^2 ) with E(yi│xi )=xi'β

and Var(yi│xi )=σ^2 ,

where, xi'β (systematic component) is the linear combination of the predictors x'i and β denoting vectors of predictors and the coefficients. Also, σ^2 is a stochastic component. Now let’s take a more detailed look at each of these components and the link function.

Stochastic Component

Stochastic Component identifies the response variable (Y=(y1,…,yn)) and assumes a probability distribution for it. When Y is a continuous variable, it is usually assumed that Y follows a Normal distribution. In fact, In GLM, we can use any distribution in Exponential Family, because this is a comprehensive class including the properties of the Normal distribution.

Systematic Component

Systematic Component identifies the predictor variables (x'=(x1,…,xn)). The Systematic Component consists of the linear combination of the variables called as linear Predictor and some linear function of them.

α+x1 β1+⋯+xn βn

The expected value of the response variable E(Y)=μ is modeled. We want to see how μ varies as a function of the levels of the predictor variables, xi's.

Link Function

Link Funtion identifies the relationship (link) between the expected value of the stochastic component

(E(Y)=μ) and the systematic component (α+x1 β1+⋯+xn βn). The link function is denoted by g(μ). It is a monotone function, that is, as the systematic part gets larger, μ gets larger (or smaller). Sometimes, the relationship between the components may be non-linear. So the general model for a GLM is

g(μ)=α+x1 β1+⋯+xn βn .

Some common links are Identity Link (ordinary regression, ANOVA, ANCOVA) with natural parameter μ, Log Link with natural parameter log(μ). Log Link is usually used when Y is nonnegative. And finally, Logit link with Natural Parameter log(μ/(1-μ)), and in addition this link is usually used when 0≤μ≤1.

1- Park, M.Y., (Department of Statistics) & Hastie, T., (Department of Statistics and Department of Health Research & Policy). (2006). L1 Regularization Path Algorithm for Generalized Linear Models. Stanford University, November 12, 2006

2- Gill, J. (2000). Generalized Linear Models: A Unified Approach. (Sage University Paper Series on Quantitative Applications in the Social Sciences. Series No: 07-134). Thousand Oaks, CA: Sage.

Analytics and Visualization of Big Data

Tuesday, March 26, 2013

General liner Models (GLM)

No comments:

Post a Comment