← All Tutorials

Tutorial 2: The Stochastic Component

Choose the distribution for binary outcomes

Systematic
Link
3 Distribution
4 Fitting
5 Implementation

Your model so far

HeartDisease ~ f(?) with $\text{logit}(p) = \beta_0 + \beta_1 \cdot \text{Age} + \ldots$

The distribution f() describes the probability of observing each outcome (0 or 1). This is the stochastic (random) component of the model.

Binary Outcomes Are Special

Our response is either 0 (no disease) or 1 (disease). This isn't continuous data that varies around a mean - it's a binary outcome where we model the probability of "success" (disease = 1).

Choose the Distribution Family

For modelling heart disease presence - a binary outcome (0 or 1) - which distribution family describes this type of data?

Click on a card to select it.

Binomial

$y \sim \text{Binomial}(n, p)$

For counting successes in n trials. When n=1, this is the Bernoulli distribution for binary outcomes.

Use when: Binary outcomes (yes/no), counts of successes

Variance: $np(1-p)$

Gaussian (Normal)

$y \sim N(\mu, \sigma^2)$

The classic bell curve. Observations scatter symmetrically around the mean.

Use when: Continuous data, symmetric residuals

Variance: constant ($\sigma^2$)

Poisson

$y \sim \text{Poisson}(\lambda)$

For count data (0, 1, 2, ...). Mean equals variance.

Use when: Counting events, integers $\geq 0$

Variance: equals $\mu$

Gamma

$y \sim \text{Gamma}(\alpha, \beta)$

For positive continuous data that's often right-skewed.

Use when: Strictly positive, right-skewed

Variance: proportional to $\mu^2$

Model Complete: Logistic Regression!

You've specified all three components of your GLM:

$\text{HeartDisease} \sim \text{Binomial}(1, p)$

where $\text{logit}(p) = \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Sex} + \beta_3 \cdot \text{CP} + \beta_4 \cdot \text{MaxHR} + \beta_5 \cdot \text{STDep}$

Systematic Component
6 terms (incl. intercept)
Link Function
Logit
Distribution
Binomial

This combination - Binomial distribution + Logit link - is logistic regression, the workhorse of binary classification. The logit is the "canonical" link for the Binomial family, making this a natural and mathematically elegant pairing.

✔ Correct!

The Binomial distribution is the natural choice for binary outcomes.

When each observation is a single trial (n=1), the Binomial reduces to the Bernoulli distribution:

$P(Y = y) = p^y (1-p)^{1-y}$ for $y \in \{0, 1\}$

Where $p$ is the probability of "success" (heart disease = 1)

Key properties:

Binomial + Logit = Logistic Regression - the most widely used method for binary classification.

❌ Not Appropriate for Binary Data

The Gaussian distribution is designed for continuous data that can take any real value.

Our outcome is binary: only 0 or 1. The Gaussian assumes:

  • Continuous values (not discrete 0/1)
  • Symmetric errors around the mean
  • Constant variance regardless of predicted value

For binary data, the variance changes with p: it's highest at p=0.5 and approaches 0 as p approaches 0 or 1. The Gaussian's constant variance assumption is violated.

Use Gaussian for continuous outcomes like heart rate, blood pressure, or test scores.

❌ Not Appropriate Here

The Poisson distribution is for count data: 0, 1, 2, 3, ... with no upper bound.

Our outcome can only be 0 or 1 - it's bounded above at 1. With Poisson:

  • You could predict values like 2, 3, or higher
  • Mean = Variance assumption doesn't fit binary data
  • It's designed for "how many events?" not "did it happen?"

Poisson is the right choice for outcomes like:

❌ Not Appropriate Here

The Gamma distribution is for positive continuous data - values like 0.5, 1.7, 23.4, etc.

Our outcome is discrete (only 0 or 1), not continuous. Gamma is used for:

  • Insurance claim amounts (always positive, can be any value)
  • Waiting times (duration until an event)
  • Rainfall amounts
  • Income or financial data

The Gamma assumes variance increases with the mean squared, which doesn't apply to binary outcomes where values can only be 0 or 1.