Jon Minton’s Blog - Part Three: glm is just fancy lm

tl;dr

This is part of a series of posts which introduce and discuss the implications of a general framework for thinking about statistical modelling. This framework is most clearly expressed in King, Tomz, and Wittenberg (2000).

Part 3: How to express a linear model as a generalised linear model

In the last part, we introduced two types of generalised linear models, with two types of transformation for the systematic component of the model, g(.), the logit transformation, and the identity transformation. This post will show how this framework is implemented in practice in R.

In R, there’s the lm function for linear models, and the glm function for generalised linear models.

I’ve argued previously that the standard linear regression is just a specific type of generalised linear model, one that makes use of an identity transformation I(.) for its systematic component g(.). Let’s now demonstrate that by producing the same model specification using both lm and glm.

We can start by being painfully unimaginative and picking using one of R’s standard datasets

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

iris |> 
  ggplot(aes(Petal.Length, Sepal.Length)) + 
  geom_point() + 
  labs(
    title = "The Iris dataset *Yawn*",
    x = "Petal Length",
    y = "Sepal Length"
  ) + 
  expand_limits(x = 0, y = 0)

It looks like, where the petal length is over 2.5, the relationship with sepal length is fairly linear

iris |> 
  filter(Petal.Length > 2.5) |> 
  ggplot(aes(Petal.Length, Sepal.Length)) + 
  geom_point() + 
  labs(
    title = "The Iris dataset *Yawn*",
    x = "Petal Length",
    y = "Sepal Length"
  ) + 
  expand_limits(x = 0, y = 0)

So, let’s make a linear regression just of this subset

iris_ss <- 
  iris |> 
  filter(Petal.Length > 2.5)

We can produce the regression using lm as follows:

mod_lm <- lm(Sepal.Length ~ Petal.Length, data = iris_ss)

And we can use the summary function (which checks the type of mod_lm and evokes summary.lm implicitly) to get the following:

summary(mod_lm)


Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris_ss)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09194 -0.26570  0.00761  0.21902  0.87502 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.99871    0.22593   13.27   <2e-16 ***
Petal.Length  0.66516    0.04542   14.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3731 on 98 degrees of freedom
Multiple R-squared:  0.6864,    Adjusted R-squared:  0.6832 
F-statistic: 214.5 on 1 and 98 DF,  p-value: < 2.2e-16

Woohoo! Three stars next to the Petal.Length coefficient! Definitely publishable!

To do the same using glm.

mod_glm <- glm(Sepal.Length ~ Petal.Length, data = iris_ss)

And we can use the summary function for this data too. In this case, summary evokes summary.glm because it knows the class of mod_glm contains glm.

summary(mod_glm)


Call:
glm(formula = Sepal.Length ~ Petal.Length, data = iris_ss)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.99871    0.22593   13.27   <2e-16 ***
Petal.Length  0.66516    0.04542   14.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1391962)

    Null deviance: 43.496  on 99  degrees of freedom
Residual deviance: 13.641  on 98  degrees of freedom
AIC: 90.58

Number of Fisher Scoring iterations: 2

So, the coefficients are exactly the same. But there’s also some additional information in the summary, including on the type of ‘family’ used. Why is this?

If we look at the help for glm we can see that, by default, the family argument is set to gaussian.

And if we delve a bit further into the help file, in the details about the family argument, it links to the family help page. The usage statement of the family help file is as follows:

family(object, ...)

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
quasi(link = "identity", variance = "constant")
quasibinomial(link = "logit")
quasipoisson(link = "log")

Each family has a default link argument, and for this gaussian family, this link is the identity function.

We can also see that, for both the binomial and quasibinomial family, the default link is logit, which transforms all predictors onto a 0-1 scale, as shown in the last post.

So, by using the default family, the Gaussian family is selected, and by using the default Gaussian family member, the identity link is selected.

We can confirm this by setting the family and link explicitly, showing that we get the same results

mod_glm2 <- glm(Sepal.Length ~ Petal.Length, family = gaussian(link = "identity"), data = iris_ss)
summary(mod_glm2)


Call:
glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(link = "identity"), 
    data = iris_ss)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.99871    0.22593   13.27   <2e-16 ***
Petal.Length  0.66516    0.04542   14.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1391962)

    Null deviance: 43.496  on 99  degrees of freedom
Residual deviance: 13.641  on 98  degrees of freedom
AIC: 90.58

Number of Fisher Scoring iterations: 2

It’s the same!

How do these terms used in the glm function, family and link, relate to the general framework in King, Tomz, and Wittenberg (2000)?

family is the stochastic component, f(.)
link is the systematic component, g(.)

They’re different terms, but it’s the same broad framework.

Linear models are just one type of general linear model!

Coming up

In the next part of this series, we will delve into the differences between linear regression models and logistic regression models, with a focus on how to get meaningful effect estimates from both types of model.

References

King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44: 341355. http://gking.harvard.edu/files/abs/making-abs.shtml.