← All Tutorials

Tutorial: The Stochastic Component

Choose how observations vary around their expected values

✓ Systematic

✓ Link

3 Distribution

4 Fitting

5 Implementation

Your model so far

MaxHeartRate ~ f(?) with E[y] = β₀ + β₁·Age + β₂·ExAng + β₃·STDep

The distribution f() describes how individual observations scatter around the expected value. This is the stochastic (random) component of the model.

Choose the Distribution Family

For modelling maximum heart rate - a continuous measure typically ranging from 60-200 bpm - which distribution family best describes how observations vary?

Click on a card to select it.

Gaussian (Normal)

y ~ N(μ, σ²)

The classic bell curve. Observations scatter symmetrically around the mean.

Use when: Continuous data, symmetric residuals, constant variance

Variance: constant (σ²)

Gamma

y ~ Gamma(α, β)

For positive continuous data that's often right-skewed. Variance increases with mean.

Use when: Strictly positive, right-skewed, variance grows with mean

Variance: proportional to μ²

Poisson

y ~ Poisson(λ)

For count data (0, 1, 2, ...). Mean equals variance.

Use when: Counting events, integers only, mean ≈ variance

Variance: equals μ

Inverse Gaussian

y ~ IG(μ, λ)

For highly right-skewed positive continuous data. Variance increases rapidly with mean.

Use when: Positive, highly skewed, variance grows as μ³

Variance: proportional to μ³

Model Complete!

You've specified all three components of your GLM:

MaxHeartRate ~ Normal(μ, σ²)

where E[MaxHeartRate] = μ = β₀ + β₁·Age + β₂·ExerciseAngina + β₃·STDepression

Systematic Component

4 terms (incl. intercept)

Link Function

Identity

Distribution

Gaussian

This is equivalent to ordinary least squares (OLS) linear regression - the foundation of statistical modelling. The GLM framework shows how this familiar model is just one special case of a much broader family.

← Back to Link Function

✔ Good Choice!

The Gaussian (Normal) distribution is the natural choice for continuous data like maximum heart rate.

With the identity link we've chosen, this gives us ordinary linear regression - the most fundamental GLM and the starting point for understanding the framework.

        Key assumptions:
        Constant variance - spread of residuals doesn't change with predicted values
Symmetric residuals - errors are equally likely above or below the line
Independence - observations don't influence each other

      

Caveat: After fitting the model, you should check these assumptions using residual plots. If residuals show patterns (heteroscedasticity, skewness), you might consider a different distribution family.

⚠ An Interesting Thought...

The Gamma distribution is designed for strictly positive continuous data, and heart rate is indeed always positive.

        However, consider:
        Gamma assumes variance increases with the mean - is heart rate variability
              really higher for people with higher max heart rates?
Gamma is typically used for right-skewed data (like insurance claims, rainfall amounts)
Heart rate tends to be more symmetrically distributed around its mean
We chose the identity link, but Gamma's canonical link is inverse (1/μ)

      

The Gamma distribution shines when you have data that's bounded at zero and shows increasing spread at higher values. For heart rate data, the Gaussian is usually more appropriate.

For this tutorial, try the Gaussian distribution. In practice, you could fit both and compare using AIC or residual diagnostics!

❌ Not Appropriate Here

The Poisson distribution is specifically for count data - non-negative integers like 0, 1, 2, 3, ...

Maximum heart rate is a continuous measurement (e.g., 142.5 bpm), not a count. You can't have "2.7 heart beats" as an observation.

        Poisson is the right choice for:
        Number of hospital admissions per day
Count of defects in manufacturing
Number of customer complaints per week
Species counts in ecological surveys

      

A key property of Poisson is that mean = variance. This is rarely true for continuous measurements like heart rate.

❌ Not Appropriate Here

The Inverse Gaussian distribution is a specialised choice for positive continuous data with a very specific variance structure.

It's typically used when:

        Data is strictly positive and highly right-skewed
Variance increases rapidly with the mean (as μ³)
Often arises in reliability analysis and time-to-event modelling

      

Heart rate data doesn't typically show this extreme variance pattern. The distribution of max heart rates across patients is usually much more symmetric than the Inverse Gaussian would imply.

The Inverse Gaussian is useful for specialised applications but isn't a common first choice for biological measurements like heart rate.