← All Tutorials

Tutorial 3: The Stochastic Component

Choose the distribution for count data

Systematic
Link
3 Distribution
4 Fitting
5 Implementation

Your model so far

RentalCount ~ f(?) with $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{Temp} + \ldots$

The distribution f() describes how the observed counts vary around the expected count. This is the stochastic (random) component of the model.

Count Data Has Special Properties

Our response is a count: 0, 1, 2, 3, ... (non-negative integers). Unlike continuous data, counts are discrete and can't be negative. The distribution we choose should respect these constraints.

Choose the Distribution Family

For modelling bike rental counts - non-negative integers (0, 1, 2, ..., thousands) - which distribution family describes this type of data?

Click on a card to select it.

Poisson

$y \sim \text{Poisson}(\mu)$

For count data (0, 1, 2, ...). The classic distribution for modelling event counts.

Use when: Counting events, non-negative integers

Variance = Mean ($\mu$)

Negative Binomial

$y \sim \text{NegBin}(\mu, \theta)$

For overdispersed counts. Allows variance to exceed the mean.

Use when: Counts with extra variability

Variance = $\mu + \mu^2/\theta$

Gaussian (Normal)

$y \sim N(\mu, \sigma^2)$

The classic bell curve. Observations scatter symmetrically around the mean.

Use when: Continuous data, symmetric residuals

Variance: constant ($\sigma^2$)

Binomial

$y \sim \text{Binomial}(n, p)$

For counting successes in n trials. Has an upper bound.

Use when: Binary outcomes, bounded counts

Variance: $np(1-p)$

Model Complete: Poisson Regression!

You've specified all three components of your GLM:

$\text{RentalCount} \sim \text{Poisson}(\mu)$

where $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{Temp} + \beta_2 \cdot \text{Hum} + \beta_3 \cdot \text{Wind} + \beta_4 \cdot \text{Work} + \beta_5 \cdot \text{Weather}$

Systematic Component
6 terms (incl. intercept)
Link Function
Log
Distribution
Poisson

This combination - Poisson distribution + Log link - is Poisson regression, the standard approach for count data. The log is the "canonical" link for the Poisson family.

Heads up: Overdispersion

A key Poisson assumption is Mean = Variance. In practice, many real datasets show overdispersion - variance greater than the mean. We'll check for this after fitting, and if present, Tutorial 4 shows how to handle it with the Negative Binomial distribution.

✔ Correct!

The Poisson distribution is the natural choice for count data.

$P(Y = k) = \frac{\mu^k e^{-\mu}}{k!}$ for $k = 0, 1, 2, \ldots$

Where $\mu$ is the expected count (rate)

Key properties:

Poisson + Log = Poisson Regression - the classic method for count data.

⚠ Excellent Thinking! (But Save It For Tutorial 4)

The Negative Binomial is actually a very good choice for real-world count data, because it handles overdispersion - when variance exceeds the mean.

Negative Binomial variance: $\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}$

The extra parameter $\theta$ allows variance to be larger than the mean.

Preview: Tutorial 4

We'll fit the Poisson model first to learn the method, then check for overdispersion. Spoiler: the bike sharing data is overdispersed! Tutorial 4 will show how to upgrade to Negative Binomial.

The diagnostic: If residual deviance >> degrees of freedom, overdispersion is present.

For now, let's use the Poisson to understand the foundational approach.

❌ Not Appropriate for Count Data

The Gaussian distribution is designed for continuous data that can take any real value, including negative numbers.

Our outcome is a count (0, 1, 2, ...). The Gaussian has several problems:

  • Allows negative values: Can't have -50 bike rentals
  • Continuous: Counts are discrete integers, not 4.7 rentals
  • Constant variance: For counts, variance often increases with the mean

Interestingly, for very large counts, the Poisson approximates the Gaussian. But for proper count modelling, use a discrete distribution.

❌ Not Appropriate Here

The Binomial distribution is for counting successes in a fixed number of trials - it has an upper bound.

Our rental counts can be any non-negative integer with no fixed maximum:

  • Binomial: Count out of n (like 8 heads in 10 coin flips)
  • Our data: Open-ended counts (22 to 8714 rentals per day)
  • There's no "maximum possible rentals" - it's unbounded

Binomial was the right choice for heart disease (yes/no) in Tutorial 2, where each patient is one "trial". But bike rental counts have no fixed upper limit.