✓
Systematic
✓
Link
✓
Distribution
✓
Fitting
✓
Implementation
6
Advanced
Under the Hood
This section derives the Poisson log-likelihood from first principles and shows why the IRLS
algorithm works. This is the mathematical foundation behind the glm() function.
1. The Poisson Distribution
The Poisson distribution models the probability of observing $y$ events when events occur
at a constant rate $\mu$:
$$P(Y = y \mid \mu) = \frac{\mu^y e^{-\mu}}{y!} \quad \text{for } y = 0, 1, 2, \ldots$$
Key properties:
- $E[Y] = \mu$ (the mean is $\mu$)
- $\text{Var}(Y) = \mu$ (variance equals the mean - the defining Poisson property)
- $y$ can only be non-negative integers
2. Building the Likelihood
For a Single Observation
The likelihood of observing count $y_i$ given rate $\mu_i$ is:
$$L(\mu_i \mid y_i) = \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}$$
For All n Observations
Assuming independence, the total likelihood is the product:
$$L(\boldsymbol{\mu} \mid \mathbf{y}) = \prod_{i=1}^{n} \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}$$
3. The Log-Likelihood
Taking the logarithm converts products to sums (much easier to work with):
Step 1: Apply log to the likelihood
$$\ell(\boldsymbol{\mu}) = \ln L = \sum_{i=1}^{n} \ln\left(\frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}\right)$$
Step 2: Expand using log rules
$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i - \ln(y_i!) \right]$$
Step 3: Drop the constant term (doesn't affect optimization)
$$\ell(\boldsymbol{\mu}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right] + \text{const}$$
The Poisson Log-Likelihood
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \ln(\mu_i) - \mu_i \right]$$
where $\mu_i = e^{\mathbf{x}_i^T \boldsymbol{\beta}}$ (via the log link)
4. Connecting to the Linear Predictor
With the log link, we have:
$$\ln(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}$$
Therefore:
$$\mu_i = e^{\eta_i} = e^{\mathbf{x}_i^T \boldsymbol{\beta}}$$
Substituting into the log-likelihood:
$$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - e^{\mathbf{x}_i^T \boldsymbol{\beta}} \right]$$
5. Finding the Maximum: Score and Hessian
The Score (First Derivative)
To find the MLE, we take the derivative with respect to $\boldsymbol{\beta}$ and set to zero:
$$\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^{n} \mathbf{x}_i \left( y_i - \mu_i \right) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$$
This is the score equation. Setting it to zero gives us the maximum likelihood equations.
The Hessian (Second Derivative)
The second derivative (curvature) is:
$$\frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = -\sum_{i=1}^{n} \mu_i \mathbf{x}_i \mathbf{x}_i^T = -\mathbf{X}^T \mathbf{W} \mathbf{X}$$
where $\mathbf{W} = \text{diag}(\mu_1, \mu_2, \ldots, \mu_n)$ is the weight matrix.
The IRLS Weights for Poisson
For Poisson regression, the weights are simply the fitted values: $W_{ii} = \mu_i$.
Compare this to logistic regression where $W_{ii} = p_i(1-p_i)$.
6. The IRLS Algorithm
Newton-Raphson optimization gives the update:
$$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$$
This can be rewritten as a weighted least squares problem:
$$\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z}$$
where $\mathbf{z}$ is the "working response":
$$z_i = \eta_i + \frac{y_i - \mu_i}{\mu_i}$$
The Algorithm
- Initialize $\boldsymbol{\beta}^{(0)}$ (often zeros, giving $\mu_i = 1$ for all $i$)
- Compute $\boldsymbol{\mu}^{(t)} = e^{\mathbf{X}\boldsymbol{\beta}^{(t)}}$
- Compute weights: $\mathbf{W}^{(t)} = \text{diag}(\boldsymbol{\mu}^{(t)})$
- Compute working response: $z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})/\mu_i^{(t)}$
- Update: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$
- Repeat until $\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon$
7. Comparison Across GLM Families
IRLS Components by GLM Family
| Family |
Link |
$\mu_i$ |
Weights $W_{ii}$ |
Variance |
| Gaussian |
Identity |
$\eta_i$ |
$1$ (constant) |
$\sigma^2$ (constant) |
| Binomial |
Logit |
$\frac{1}{1+e^{-\eta_i}}$ |
$p_i(1-p_i)$ |
$p_i(1-p_i)$ |
| Poisson |
Log |
$e^{\eta_i}$ |
$\mu_i$ |
$\mu_i$ |
| Gamma |
Log |
$e^{\eta_i}$ |
$1$ (constant) |
$\mu_i^2 / \phi$ |
Note: For Gamma with log link, the IRLS weight is $(d\mu/d\eta)^2 / V(\mu) = \mu^2 / \mu^2 = 1$.
With the canonical inverse link $g(\mu)=1/\mu$, the weight would be $\mu_i^2$.
Notice: The algorithm structure is identical across all GLMs! Only the link function
(how $\mu$ relates to $\eta$) and the weights change.
8. Deviance: Measuring Model Fit
The deviance compares the fitted model to a "saturated" model (one parameter per observation):
$$D = 2 \sum_{i=1}^{n} \left[ y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \right]$$
For Poisson data without overdispersion, the residual deviance should approximately equal the
degrees of freedom. Our bike rental model:
$$\frac{D}{df} = \frac{380,005}{725} = 524.1 \gg 1$$
This extreme ratio indicates overdispersion - the variance in our data exceeds
what the Poisson model expects.
Tutorial 3 Complete!
You've learned Poisson regression for count data. Key takeaways:
- Log link ensures positive predictions and gives rate ratio interpretation
- Poisson distribution assumes Mean = Variance
- exp(coefficient) = multiplicative effect on expected count
- Check for overdispersion: residual deviance / df should be ~1
- Our bike data is severely overdispersed - real-world data often is!
What's Next? Tutorial 4 shows how to handle overdispersion using the
Negative Binomial distribution, which adds a parameter to allow variance > mean.