← All Tutorials

Tutorial 2: The Stochastic Component

Choose the distribution for binary outcomes

✓ Systematic

✓ Link

3 Distribution

4 Fitting

5 Implementation

Your model so far

HeartDisease ~ f(?) with $\text{logit}(p) = \beta_0 + \beta_1 \cdot \text{Age} + \ldots$

The distribution f() describes the probability of observing each outcome (0 or 1). This is the stochastic (random) component of the model.

Binary Outcomes Are Special

Our response is either 0 (no disease) or 1 (disease). This isn't continuous data that varies around a mean - it's a binary outcome where we model the probability of "success" (disease = 1).

Choose the Distribution Family

For modelling heart disease presence - a binary outcome (0 or 1) - which distribution family describes this type of data?

Click on a card to select it.

Binomial

$y \sim \text{Binomial}(n, p)$

For counting successes in n trials. When n=1, this is the Bernoulli distribution for binary outcomes.

Use when: Binary outcomes (yes/no), counts of successes

Variance: $np(1-p)$

Gaussian (Normal)

$y \sim N(\mu, \sigma^2)$

The classic bell curve. Observations scatter symmetrically around the mean.

Use when: Continuous data, symmetric residuals

Variance: constant ($\sigma^2$)

Poisson

$y \sim \text{Poisson}(\lambda)$

For count data (0, 1, 2, ...). Mean equals variance.

Use when: Counting events, integers $\geq 0$

Variance: equals $\mu$

Gamma

$y \sim \text{Gamma}(\alpha, \beta)$

For positive continuous data that's often right-skewed.

Use when: Strictly positive, right-skewed

Variance: proportional to $\mu^2$

Model Complete: Logistic Regression!

You've specified all three components of your GLM:

$\text{HeartDisease} \sim \text{Binomial}(1, p)$

where $\text{logit}(p) = \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Sex} + \beta_3 \cdot \text{CP} + \beta_4 \cdot \text{MaxHR} + \beta_5 \cdot \text{STDep}$

Systematic Component

6 terms (incl. intercept)

Link Function

Logit

Distribution

Binomial

This combination - Binomial distribution + Logit link - is logistic regression, the workhorse of binary classification. The logit is the "canonical" link for the Binomial family, making this a natural and mathematically elegant pairing.

← Back to Link Function

✔ Correct!

The Binomial distribution is the natural choice for binary outcomes.

When each observation is a single trial (n=1), the Binomial reduces to the Bernoulli distribution:

        $P(Y = y) = p^y (1-p)^{1-y}$ for $y \in \{0, 1\}$
        
        Where $p$ is the probability of "success" (heart disease = 1)

Key properties:

Variance = $p(1-p)$: Variance is highest at $p=0.5$, lowest near 0 or 1
Canonical link: The logit is the natural pairing for Binomial in GLM theory
Dispersion fixed at 1: Unlike Gaussian GLM, no separate variance parameter

Binomial + Logit = Logistic Regression - the most widely used method for binary classification.

❌ Not Appropriate for Binary Data

The Gaussian distribution is designed for continuous data that can take any real value.

Our outcome is binary: only 0 or 1. The Gaussian assumes:

        Continuous values (not discrete 0/1)
Symmetric errors around the mean
Constant variance regardless of predicted value

      

For binary data, the variance changes with p: it's highest at p=0.5 and approaches 0 as p approaches 0 or 1. The Gaussian's constant variance assumption is violated.

Use Gaussian for continuous outcomes like heart rate, blood pressure, or test scores.

❌ Not Appropriate Here

The Poisson distribution is for count data: 0, 1, 2, 3, ... with no upper bound.

Our outcome can only be 0 or 1 - it's bounded above at 1. With Poisson:

        You could predict values like 2, 3, or higher
Mean = Variance assumption doesn't fit binary data
It's designed for "how many events?" not "did it happen?"

      

Poisson is the right choice for outcomes like:

Number of hospital admissions per day
Counts of accidents per month
Number of customer complaints

❌ Not Appropriate Here

The Gamma distribution is for positive continuous data - values like 0.5, 1.7, 23.4, etc.

Our outcome is discrete (only 0 or 1), not continuous. Gamma is used for:

        Insurance claim amounts (always positive, can be any value)
Waiting times (duration until an event)
Rainfall amounts
Income or financial data

      

The Gamma assumes variance increases with the mean squared, which doesn't apply to binary outcomes where values can only be 0 or 1.