GLM Tutorial: Poisson Regression Implementation

✓ Systematic

✓ Link

✓ Distribution

✓ Fitting

5 Implementation

Putting It All Together

Now we'll implement our Poisson GLM using real code. Both R and Python have excellent GLM support, and the results will be identical (within rounding).

Our model: $\text{Count} \sim \text{Poisson}(\mu)$ with $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{temp} + \beta_2 \cdot \text{hum} + \beta_3 \cdot \text{windspeed} + \beta_4 \cdot \text{workingday} + \beta_5 \cdot \text{weathersit}$

        1. Load and Explore Data
        # Load the UCI Bike Sharing dataset (daily aggregates)
bike <- read.csv("day.csv")

# Quick look at the data
dim(bike)
# [1] 731  16

summary(bike$cnt)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#      22    3152    4548    4504    5956    8714
      
2. Fit the Poisson GLM# Fit Poisson regression with log link (the default for Poisson)
fit <- glm(cnt ~ temp + hum + windspeed + workingday + weathersit,
           family = poisson(link = "log"),
           data = bike)

summary(fit)
          Call:
glm(formula = cnt ~ temp + hum + windspeed + workingday + weathersit,
    family = poisson(link = "log"), data = bike)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  8.239119   0.003845 2142.57   <2e-16 ***
temp         1.397084   0.003179  439.44   <2e-16 ***
hum         -0.356839   0.005271  -67.69   <2e-16 ***
windspeed   -0.967277   0.008000 -120.91   <2e-16 ***
workingday   0.038946   0.001202   32.39   <2e-16 ***
weathersit  -0.129765   0.001380  -94.02   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 668801  on 730  degrees of freedom
Residual deviance: 380005  on 725  degrees of freedom
AIC: 387415

Number of Fisher Scoring iterations: 4
        

        3. Calculate Rate Ratios
        # For Poisson regression, exp(coefficient) gives the rate ratio
exp(coef(fit))
#  (Intercept)         temp          hum    windspeed   workingday   weathersit
# 3.779e+03    4.043e+00    7.001e-01    3.800e-01    1.040e+00    8.783e-01

# With 95% confidence intervals
exp(confint(fit))
      
4. Check for Overdispersion# Key diagnostic: residual deviance should be close to df
# Ratio >> 1 indicates overdispersion
fit$deviance / fit$df.residual
# [1] 524.1
          # Severe overdispersion detected!
# Residual deviance: 380,005
# Degrees of freedom: 725
# Ratio: 524.1 (should be ~1 for Poisson to be appropriate)
        

        1. Load and Explore Data
        import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

# Load the UCI Bike Sharing dataset
bike = pd.read_csv("day.csv")

# Quick look at the data
print(bike.shape)
# (731, 16)

print(bike['cnt'].describe())
# count     731.000000
# mean     4504.348837
# std      1937.211452
# min        22.000000
# 25%      3152.000000
# 50%      4548.000000
# 75%      5956.000000
# max      8714.000000
      
2. Fit the Poisson GLM# Fit Poisson regression with log link
fit = smf.glm('cnt ~ temp + hum + windspeed + workingday + weathersit',
              data=bike,
              family=sm.families.Poisson()).fit()

print(fit.summary())
                           Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    cnt   No. Observations:                  731
Model:                            GLM   Df Residuals:                      725
Model Family:                 Poisson   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.9370e+05
Date:                Wed, 11 Dec 2025   Deviance:                   3.8000e+05
Time:                        16:58:06   Pearson chi2:                 3.68e+05
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.2391      0.004   2142.559      0.000       8.232       8.247
temp           1.3971      0.003    439.442      0.000       1.391       1.403
hum           -0.3568      0.005    -67.693      0.000      -0.367      -0.347
windspeed     -0.9673      0.008   -120.911      0.000      -0.983      -0.952
workingday     0.0389      0.001     32.390      0.000       0.037       0.041
weathersit    -0.1298      0.001    -94.024      0.000      -0.132      -0.127
==============================================================================
        

        3. Calculate Rate Ratios
        # For Poisson regression, exp(coefficient) gives the rate ratio
print("Rate Ratios:")
print(np.exp(fit.params))
# Intercept     3779.086
# temp             4.043
# hum              0.700
# windspeed        0.380
# workingday       1.040
# weathersit       0.878

# With 95% confidence intervals
print("\nRate Ratios with 95% CI:")
print(np.exp(fit.conf_int()))
      
4. Check for Overdispersion# Key diagnostic: deviance / df_residual should be ~1
overdispersion_ratio = fit.deviance / fit.df_resid
print(f"Overdispersion ratio: {overdispersion_ratio:.1f}")
# Overdispersion ratio: 524.1
          # Severe overdispersion detected!
# Deviance: 380,005
# Degrees of freedom: 725
# Ratio: 524.1 (should be ~1 for Poisson to be appropriate)
        

Interpreting the Results

The coefficients are on the log scale. To interpret them, we exponentiate to get rate ratios - the multiplicative effect on expected count.

Predictor	Coefficient ($\beta$)	Rate Ratio ($e^\beta$)	Interpretation
temp	+1.397	4.04	Full temp range (0 to 1) = 4x more rentals
hum	-0.357	0.70	Full humidity range = 30% fewer rentals
windspeed	-0.967	0.38	Full wind range = 62% fewer rentals
workingday	+0.039	1.04	Working days = 4% more rentals
weathersit	-0.130	0.88	Each worse weather category = 12% fewer rentals

Rate Ratio Interpretation

Temperature effect: $e^{1.397} = 4.04$ means that going from the coldest to warmest temperature (normalized 0-1) multiplies expected rentals by 4. Temperature is the strongest predictor.

Note: These predictors are normalized (0-1 scale), so the coefficients represent the effect across their full range, not a "one unit" change in the original scale.

Overdispersion: A Key Limitation

The model diagnostics reveal severe overdispersion. The Poisson model assumes Mean = Variance, but our data has variance much greater than the mean.

Residual Deviance / df = 380,005 / 725 = 524.1
(Should be close to 1 for Poisson to be appropriate)

What does this mean?

The standard errors are underestimated (too small)
P-values are too optimistic (appear more significant than they are)
Confidence intervals are too narrow
The model structure may be correct, but statistical inference is unreliable

The solution? Use the Negative Binomial distribution (Tutorial 4), which has an extra parameter to account for overdispersion.

Where do the Std. Error columns in these outputs come from? From the curvature of the log-likelihood at its peak — see Standard Errors from Curvature.

← Back to Fitting Continue to Advanced Topics →