GLM Tutorial: Poisson Log-Likelihood Derivation

✓ Systematic

✓ Link

✓ Distribution

✓ Fitting

✓ Implementation

6 Advanced

Under the Hood

This section derives the Poisson log-likelihood from first principles and shows why the IRLS algorithm works. This is the mathematical foundation behind the glm() function.

1. The Poisson Distribution

The Poisson distribution models the probability of observing $y$ events when events occur at a constant rate $\mu$:

P(Y = y \mid \mu) = \frac{\mu^y e^{-\mu}}{y!} \quad \text{for } y = 0, 1, 2, \ldots

Key properties:

$E[Y] = \mu$ (the mean is $\mu$)
$\text{Var}(Y) = \mu$ (variance equals the mean - the defining Poisson property)
$y$ can only be non-negative integers

2. Building the Likelihood

For a Single Observation

The likelihood of observing count $y_i$ given rate $\mu_i$ is:

L(\mu_i \mid y_i) = \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}

For All n Observations

Assuming independence, the total likelihood is the product:

L(\boldsymbol{\mu} \mid \mathbf{y}) = \prod_{i=1}^{n} \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}

3. The Log-Likelihood

Taking the logarithm converts products to sums (much easier to work with):

Step 1: Apply log to the likelihood

$$\ell(\boldsymbol{\mu}) = \ln L = \sum_{i=1}^{n} \ln\left(\frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}\right)$$

Step 2: Expand using log rules

$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i - \ln(y_i!) \right]$$

Step 3: Drop the constant term (doesn't affect optimization)

$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right] + \text{const}$$

The Poisson Log-Likelihood

$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right]$$ where $\mu_i = e^{\mathbf{x}_i^T \boldsymbol{\beta}}$ (via the log link)

4. Connecting to the Linear Predictor

With the log link, we have:

\ln(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}

Therefore:

\mu_i = e^{\eta_i} = e^{\mathbf{x}_i^T \boldsymbol{\beta}}

Substituting into the log-likelihood:

\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} \right]

5. Finding the Maximum: Score and Hessian

The Score (First Derivative)

To find the MLE, we take the derivative with respect to $\boldsymbol{\beta}$ and set to zero:

\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^{n} \mathbf{x}_i \left( y_i - \mu_i \right) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})

This is the score equation. Setting it to zero gives us the maximum likelihood equations.

The Hessian (Second Derivative)

The second derivative (curvature) is:

\frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = -\sum_{i=1}^{n} \mu_i \mathbf{x}_i \mathbf{x}_i^T = -\mathbf{X}^T \mathbf{W} \mathbf{X}

where $\mathbf{W} = \text{diag}(\mu_1, \mu_2, \ldots, \mu_n)$ is the weight matrix.

The IRLS Weights for Poisson

For Poisson regression, the weights are simply the fitted values: $W_{ii} = \mu_i$. Compare this to logistic regression where $W_{ii} = p_i(1-p_i)$.

6. The IRLS Algorithm

Newton-Raphson optimization gives the update:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})

This can be rewritten as a weighted least squares problem:

\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z}

where $\mathbf{z}$ is the "working response":

z_i = \eta_i + \frac{y_i - \mu_i}{\mu_i}

The Algorithm

Initialize $\boldsymbol{\beta}^{(0)}$ (often zeros, giving $\mu_i = 1$ for all $i$)
Compute $\boldsymbol{\mu}^{(t)} = e^{\mathbf{X}\boldsymbol{\beta}^{(t)}}$
Compute weights: $\mathbf{W}^{(t)} = \text{diag}(\boldsymbol{\mu}^{(t)})$
Compute working response: $z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})/\mu_i^{(t)}$
Update: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$
Repeat until $\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon$

7. Comparison Across GLM Families

IRLS Components by GLM Family

Family	Link	$\mu_i$	Weights $W_{ii}$	Variance
Gaussian	Identity	$\eta_i$	$1$ (constant)	$\sigma^2$ (constant)
Binomial	Logit	$\frac{1}{1+e^{-\eta_i}}$	$p_i(1-p_i)$	$p_i(1-p_i)$
Poisson	Log	$e^{\eta_i}$	$\mu_i$	$\mu_i$
Gamma	Log	$e^{\eta_i}$	$1$ (constant)	$\mu_i^2 / \phi$

Note: For Gamma with log link, the IRLS weight is $(d\mu/d\eta)^2 / V(\mu) = \mu^2 / \mu^2 = 1$. With the canonical inverse link $g(\mu)=1/\mu$, the weight would be $\mu_i^2$.

Notice: The algorithm structure is identical across all GLMs! Only the link function (how $\mu$ relates to $\eta$) and the weights change.

8. Deviance: Measuring Model Fit

The deviance compares the fitted model to a "saturated" model (one parameter per observation):

D = 2 \sum_{i=1}^{n} \left[ y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \right]

For Poisson data without overdispersion, the residual deviance should approximately equal the degrees of freedom. Our bike rental model:

\frac{D}{df} = \frac{380,005}{725} = 524.1 \gg 1

This extreme ratio indicates overdispersion - the variance in our data exceeds what the Poisson model expects.

Tutorial 3 Complete!

You've learned Poisson regression for count data. Key takeaways:

Log link ensures positive predictions and gives rate ratio interpretation
Poisson distribution assumes Mean = Variance
exp(coefficient) = multiplicative effect on expected count
Check for overdispersion: residual deviance / df should be ~1
Our bike data is severely overdispersed - real-world data often is!

What's Next? Tutorial 4 shows how to handle overdispersion using the Negative Binomial distribution, which adds a parameter to allow variance > mean.

To use a fitted model for simulation-based inference, see the Complete Simulation Example on JonStats.

← Back to Implementation Next: Handling Overdispersion →