← All Tutorials

Tutorial 3: Implementation in R and Python

Fit the Poisson regression model to predict bike rental counts

Systematic
Link
Distribution
Fitting
5 Implementation

Putting It All Together

Now we'll implement our Poisson GLM using real code. Both R and Python have excellent GLM support, and the results will be identical (within rounding).

Our model: $\text{Count} \sim \text{Poisson}(\mu)$ with $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{temp} + \beta_2 \cdot \text{hum} + \beta_3 \cdot \text{windspeed} + \beta_4 \cdot \text{workingday} + \beta_5 \cdot \text{weathersit}$

1. Load and Explore Data

# Load the UCI Bike Sharing dataset (daily aggregates)
bike <- read.csv("day.csv")

# Quick look at the data
dim(bike)
# [1] 731  16

summary(bike$cnt)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#      22    3152    4548    4504    5956    8714

2. Fit the Poisson GLM

# Fit Poisson regression with log link (the default for Poisson)
fit <- glm(cnt ~ temp + hum + windspeed + workingday + weathersit,
           family = poisson(link = "log"),
           data = bike)

summary(fit)
Call:
glm(formula = cnt ~ temp + hum + windspeed + workingday + weathersit,
    family = poisson(link = "log"), data = bike)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  8.239119   0.003845 2142.57   <2e-16 ***
temp         1.397084   0.003179  439.44   <2e-16 ***
hum         -0.356839   0.005271  -67.69   <2e-16 ***
windspeed   -0.967277   0.008000 -120.91   <2e-16 ***
workingday   0.038946   0.001202   32.39   <2e-16 ***
weathersit  -0.129765   0.001380  -94.02   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 668801  on 730  degrees of freedom
Residual deviance: 380005  on 725  degrees of freedom
AIC: 387415

Number of Fisher Scoring iterations: 4

3. Calculate Rate Ratios

# For Poisson regression, exp(coefficient) gives the rate ratio
exp(coef(fit))
#  (Intercept)         temp          hum    windspeed   workingday   weathersit
# 3.779e+03    4.043e+00    7.001e-01    3.800e-01    1.040e+00    8.783e-01

# With 95% confidence intervals
exp(confint(fit))

4. Check for Overdispersion

# Key diagnostic: residual deviance should be close to df
# Ratio >> 1 indicates overdispersion
fit$deviance / fit$df.residual
# [1] 524.1
# Severe overdispersion detected!
# Residual deviance: 380,005
# Degrees of freedom: 725
# Ratio: 524.1 (should be ~1 for Poisson to be appropriate)

1. Load and Explore Data

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

# Load the UCI Bike Sharing dataset
bike = pd.read_csv("day.csv")

# Quick look at the data
print(bike.shape)
# (731, 16)

print(bike['cnt'].describe())
# count     731.000000
# mean     4504.348837
# std      1937.211452
# min        22.000000
# 25%      3152.000000
# 50%      4548.000000
# 75%      5956.000000
# max      8714.000000

2. Fit the Poisson GLM

# Fit Poisson regression with log link
fit = smf.glm('cnt ~ temp + hum + windspeed + workingday + weathersit',
              data=bike,
              family=sm.families.Poisson()).fit()

print(fit.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    cnt   No. Observations:                  731
Model:                            GLM   Df Residuals:                      725
Model Family:                 Poisson   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -1.9370e+05
Date:                Wed, 11 Dec 2025   Deviance:                   3.8000e+05
Time:                        16:58:06   Pearson chi2:                 3.68e+05
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.2391      0.004   2142.559      0.000       8.232       8.247
temp           1.3971      0.003    439.442      0.000       1.391       1.403
hum           -0.3568      0.005    -67.693      0.000      -0.367      -0.347
windspeed     -0.9673      0.008   -120.911      0.000      -0.983      -0.952
workingday     0.0389      0.001     32.390      0.000       0.037       0.041
weathersit    -0.1298      0.001    -94.024      0.000      -0.132      -0.127
==============================================================================

3. Calculate Rate Ratios

# For Poisson regression, exp(coefficient) gives the rate ratio
print("Rate Ratios:")
print(np.exp(fit.params))
# Intercept     3779.086
# temp             4.043
# hum              0.700
# windspeed        0.380
# workingday       1.040
# weathersit       0.878

# With 95% confidence intervals
print("\nRate Ratios with 95% CI:")
print(np.exp(fit.conf_int()))

4. Check for Overdispersion

# Key diagnostic: deviance / df_residual should be ~1
overdispersion_ratio = fit.deviance / fit.df_resid
print(f"Overdispersion ratio: {overdispersion_ratio:.1f}")
# Overdispersion ratio: 524.1
# Severe overdispersion detected!
# Deviance: 380,005
# Degrees of freedom: 725
# Ratio: 524.1 (should be ~1 for Poisson to be appropriate)

Interpreting the Results

The coefficients are on the log scale. To interpret them, we exponentiate to get rate ratios - the multiplicative effect on expected count.

Predictor Coefficient ($\beta$) Rate Ratio ($e^\beta$) Interpretation
temp +1.397 4.04 Full temp range (0 to 1) = 4x more rentals
hum -0.357 0.70 Full humidity range = 30% fewer rentals
windspeed -0.967 0.38 Full wind range = 62% fewer rentals
workingday +0.039 1.04 Working days = 4% more rentals
weathersit -0.130 0.88 Each worse weather category = 12% fewer rentals

Rate Ratio Interpretation

Temperature effect: $e^{1.397} = 4.04$ means that going from the coldest to warmest temperature (normalized 0-1) multiplies expected rentals by 4. Temperature is the strongest predictor.

Note: These predictors are normalized (0-1 scale), so the coefficients represent the effect across their full range, not a "one unit" change in the original scale.

Overdispersion: A Key Limitation

The model diagnostics reveal severe overdispersion. The Poisson model assumes Mean = Variance, but our data has variance much greater than the mean.

Residual Deviance / df = 380,005 / 725 = 524.1
(Should be close to 1 for Poisson to be appropriate)

What does this mean?

The solution? Use the Negative Binomial distribution (Tutorial 4), which has an extra parameter to account for overdispersion.