Tutorial 3: Advanced - The Mathematics of Poisson Regression

Deriving the log-likelihood and understanding the optimization

Systematic
Link
Distribution
Fitting
Implementation
6 Advanced

Under the Hood

This section derives the Poisson log-likelihood from first principles and shows why the IRLS algorithm works. This is the mathematical foundation behind the glm() function.

1. The Poisson Distribution

The Poisson distribution models the probability of observing $y$ events when events occur at a constant rate $\mu$:

$$P(Y = y \mid \mu) = \frac{\mu^y e^{-\mu}}{y!} \quad \text{for } y = 0, 1, 2, \ldots$$

Key properties:

2. Building the Likelihood

For a Single Observation

The likelihood of observing count $y_i$ given rate $\mu_i$ is:

$$L(\mu_i \mid y_i) = \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}$$

For All n Observations

Assuming independence, the total likelihood is the product:

$$L(\boldsymbol{\mu} \mid \mathbf{y}) = \prod_{i=1}^{n} \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}$$

3. The Log-Likelihood

Taking the logarithm converts products to sums (much easier to work with):

Step 1: Apply log to the likelihood
$$\ell(\boldsymbol{\mu}) = \ln L = \sum_{i=1}^{n} \ln\left(\frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}\right)$$
Step 2: Expand using log rules
$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i - \ln(y_i!) \right]$$
Step 3: Drop the constant term (doesn't affect optimization)
$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right] + \text{const}$$

The Poisson Log-Likelihood

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right]$$ where $\mu_i = e^{\mathbf{x}_i^T \boldsymbol{\beta}}$ (via the log link)

4. Connecting to the Linear Predictor

With the log link, we have:

$$\ln(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}$$

Therefore:

$$\mu_i = e^{\eta_i} = e^{\mathbf{x}_i^T \boldsymbol{\beta}}$$

Substituting into the log-likelihood:

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} \right]$$

5. Finding the Maximum: Score and Hessian

The Score (First Derivative)

To find the MLE, we take the derivative with respect to $\boldsymbol{\beta}$ and set to zero:

$$\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^{n} \mathbf{x}_i \left( y_i - \mu_i \right) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$$

This is the score equation. Setting it to zero gives us the maximum likelihood equations.

The Hessian (Second Derivative)

The second derivative (curvature) is:

$$\frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = -\sum_{i=1}^{n} \mu_i \mathbf{x}_i \mathbf{x}_i^T = -\mathbf{X}^T \mathbf{W} \mathbf{X}$$

where $\mathbf{W} = \text{diag}(\mu_1, \mu_2, \ldots, \mu_n)$ is the weight matrix.

The IRLS Weights for Poisson

For Poisson regression, the weights are simply the fitted values: $W_{ii} = \mu_i$. Compare this to logistic regression where $W_{ii} = p_i(1-p_i)$.

6. The IRLS Algorithm

Newton-Raphson optimization gives the update:

$$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$$

This can be rewritten as a weighted least squares problem:

$$\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z}$$

where $\mathbf{z}$ is the "working response":

$$z_i = \eta_i + \frac{y_i - \mu_i}{\mu_i}$$

The Algorithm

  1. Initialize $\boldsymbol{\beta}^{(0)}$ (often zeros, giving $\mu_i = 1$ for all $i$)
  2. Compute $\boldsymbol{\mu}^{(t)} = e^{\mathbf{X}\boldsymbol{\beta}^{(t)}}$
  3. Compute weights: $\mathbf{W}^{(t)} = \text{diag}(\boldsymbol{\mu}^{(t)})$
  4. Compute working response: $z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})/\mu_i^{(t)}$
  5. Update: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$
  6. Repeat until $\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon$

7. Comparison Across GLM Families

IRLS Components by GLM Family

Family Link $\mu_i$ Weights $W_{ii}$ Variance
Gaussian Identity $\eta_i$ $1$ (constant) $\sigma^2$ (constant)
Binomial Logit $\frac{1}{1+e^{-\eta_i}}$ $p_i(1-p_i)$ $p_i(1-p_i)$
Poisson Log $e^{\eta_i}$ $\mu_i$ $\mu_i$
Gamma Log $e^{\eta_i}$ $1$ (constant) $\mu_i^2 / \phi$

Note: For Gamma with log link, the IRLS weight is $(d\mu/d\eta)^2 / V(\mu) = \mu^2 / \mu^2 = 1$. With the canonical inverse link $g(\mu)=1/\mu$, the weight would be $\mu_i^2$.

Notice: The algorithm structure is identical across all GLMs! Only the link function (how $\mu$ relates to $\eta$) and the weights change.

8. Deviance: Measuring Model Fit

The deviance compares the fitted model to a "saturated" model (one parameter per observation):

$$D = 2 \sum_{i=1}^{n} \left[ y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \right]$$

For Poisson data without overdispersion, the residual deviance should approximately equal the degrees of freedom. Our bike rental model:

$$\frac{D}{df} = \frac{380,005}{725} = 524.1 \gg 1$$

This extreme ratio indicates overdispersion - the variance in our data exceeds what the Poisson model expects.

Tutorial 3 Complete!

You've learned Poisson regression for count data. Key takeaways:

What's Next? Tutorial 4 shows how to handle overdispersion using the Negative Binomial distribution, which adds a parameter to allow variance > mean.