GLM Tutorial: Choose the Link Function for Counts

✔ Correct Choice!

The log link is the canonical choice for count data, giving us Poisson regression (when combined with the Poisson distribution).

The log function ensures predictions are always positive:

        $\ln(\mu) = \eta \quad \Rightarrow \quad \mu = e^\eta > 0$
        
        Since $e^x > 0$ for all real $x$, our predicted counts can never be negative!

Why log dominates for counts:

Rate ratio interpretation: $e^\beta$ gives the multiplicative effect on the rate
Canonical link: Natural pairing with Poisson distribution in GLM theory
Guarantees positivity: Can't predict -50 bike rentals
Multiplicative effects: Many count processes have multiplicative relationships

⚠ Valid, But Not Preferred Today

The square root link is mathematically valid for count data - it also ensures positive predictions. However, it's rarely used today.

Historical Context

Pre-GLM era (before 1970s): The square root was used as a "variance-stabilizing transformation" for count data.

Why sqrt was used: For Poisson data, Var$(Y) = \mu$, so variance increases with the mean. The square root transformation was thought to "stabilize" this variance.

Why log won: GLM theory showed that the log link is the "canonical" link for Poisson, and the variance issue is handled by the model itself. Plus, log coefficients have cleaner interpretation (rate ratios).

        The interpretation problem:
        
        With sqrt link: $\sqrt{\mu} = \beta_0 + \beta_1 X$
        
        What does $\beta_1$ mean? It's the change in $\sqrt{\mu}$ per unit change in $X$ - not intuitive!
        
        With log link: $e^{\beta_1}$ is the rate ratio - much more interpretable.

For this tutorial, use the log link - it's the modern standard with better interpretation.

❌ Can Predict Negative Counts

The identity link doesn't constrain predictions to be positive:

        $\mu = \beta_0 + \beta_1 \cdot \text{Temp} + \beta_2 \cdot \text{Humidity} + \ldots$
      

The problem: Linear combinations can produce any real number:

Prediction: -247 bike rentals (impossible!)
Counts must be 0, 1, 2, 3, ... - never negative

The identity link is appropriate for continuous responses that can be any real number (like temperature or weight change), but not for counts.

❌ Wrong Domain

The logit link is designed for probabilities, not counts:

        $\text{logit}(\mu) = \ln\left(\frac{\mu}{1-\mu}\right)$
        
        This requires $0 < \mu < 1$ - it only makes sense for probabilities!

The problems:

Bike rental counts can be 22, 4548, 8714... not between 0 and 1
$\ln(\frac{4548}{1-4548})$ is undefined (negative number in denominator)
Logit is for binary outcomes (Tutorial 2), not counts

Remember: Tutorial 2 used logit for heart disease (yes/no). This tutorial has counts (0, 1, 2, ..., thousands).

Tutorial 3: The Link Function for Count Data

Your model so far

The Key Constraint

Choose the Link Function

Log

Square Root

Identity

Logit

Link Function Selected: Log

What the coefficients mean

✔ Correct Choice!

⚠ Valid, But Not Preferred Today

Historical Context

❌ Can Predict Negative Counts

❌ Wrong Domain