← All Tutorials

Tutorial 3: The Stochastic Component

Choose the distribution for count data

✓ Systematic

✓ Link

3 Distribution

4 Fitting

5 Implementation

Your model so far

RentalCount ~ f(?) with $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{Temp} + \ldots$

The distribution f() describes how the observed counts vary around the expected count. This is the stochastic (random) component of the model.

Count Data Has Special Properties

Our response is a count: 0, 1, 2, 3, ... (non-negative integers). Unlike continuous data, counts are discrete and can't be negative. The distribution we choose should respect these constraints.

Choose the Distribution Family

For modelling bike rental counts - non-negative integers (0, 1, 2, ..., thousands) - which distribution family describes this type of data?

Click on a card to select it.

Poisson

$y \sim \text{Poisson}(\mu)$

For count data (0, 1, 2, ...). The classic distribution for modelling event counts.

Use when: Counting events, non-negative integers

Variance = Mean ($\mu$)

Negative Binomial

$y \sim \text{NegBin}(\mu, \theta)$

For overdispersed counts. Allows variance to exceed the mean.

Use when: Counts with extra variability

Variance = $\mu + \mu^2/\theta$

Gaussian (Normal)

$y \sim N(\mu, \sigma^2)$

The classic bell curve. Observations scatter symmetrically around the mean.

Use when: Continuous data, symmetric residuals

Variance: constant ($\sigma^2$)

Binomial

$y \sim \text{Binomial}(n, p)$

For counting successes in n trials. Has an upper bound.

Use when: Binary outcomes, bounded counts

Variance: $np(1-p)$

Model Complete: Poisson Regression!

You've specified all three components of your GLM:

$\text{RentalCount} \sim \text{Poisson}(\mu)$

where $\ln(\mu) = \beta_0 + \beta_1 \cdot \text{Temp} + \beta_2 \cdot \text{Hum} + \beta_3 \cdot \text{Wind} + \beta_4 \cdot \text{Work} + \beta_5 \cdot \text{Weather}$

Systematic Component

6 terms (incl. intercept)

Link Function

Log

Distribution

Poisson

This combination - Poisson distribution + Log link - is Poisson regression, the standard approach for count data. The log is the "canonical" link for the Poisson family.

Heads up: Overdispersion

A key Poisson assumption is Mean = Variance. In practice, many real datasets show overdispersion - variance greater than the mean. We'll check for this after fitting, and if present, Tutorial 4 shows how to handle it with the Negative Binomial distribution.

← Back to Link Function

✔ Correct!

The Poisson distribution is the natural choice for count data.

        $P(Y = k) = \frac{\mu^k e^{-\mu}}{k!}$ for $k = 0, 1, 2, \ldots$
        
        Where $\mu$ is the expected count (rate)

Key properties:

Non-negative integers: Only 0, 1, 2, ... are possible
Mean = Variance = $\mu$: This is the defining Poisson property
Canonical link: The log link is the natural pairing in GLM theory
Rate interpretation: $e^\beta$ gives multiplicative effects on the rate

Poisson + Log = Poisson Regression - the classic method for count data.

⚠ Excellent Thinking! (But Save It For Tutorial 4)

The Negative Binomial is actually a very good choice for real-world count data, because it handles overdispersion - when variance exceeds the mean.

        Negative Binomial variance: $\text{Var}(Y) = \mu + \frac{\mu^2}{\theta}$
        
        The extra parameter $\theta$ allows variance to be larger than the mean.

Preview: Tutorial 4

We'll fit the Poisson model first to learn the method, then check for overdispersion. Spoiler: the bike sharing data is overdispersed! Tutorial 4 will show how to upgrade to Negative Binomial.

The diagnostic: If residual deviance >> degrees of freedom, overdispersion is present.

For now, let's use the Poisson to understand the foundational approach.

❌ Not Appropriate for Count Data

The Gaussian distribution is designed for continuous data that can take any real value, including negative numbers.

Our outcome is a count (0, 1, 2, ...). The Gaussian has several problems:

        Allows negative values: Can't have -50 bike rentals
Continuous: Counts are discrete integers, not 4.7 rentals
Constant variance: For counts, variance often increases with the mean

      

Interestingly, for very large counts, the Poisson approximates the Gaussian. But for proper count modelling, use a discrete distribution.

❌ Not Appropriate Here

The Binomial distribution is for counting successes in a fixed number of trials - it has an upper bound.

Our rental counts can be any non-negative integer with no fixed maximum:

        Binomial: Count out of n (like 8 heads in 10 coin flips)
Our data: Open-ended counts (22 to 8714 rentals per day)
There's no "maximum possible rentals" - it's unbounded

      

Binomial was the right choice for heart disease (yes/no) in Tutorial 2, where each patient is one "trial". But bike rental counts have no fixed upper limit.