← All Tutorials

Tutorial 2: The Link Function for Classification

Choose how to connect your predictors to the probability of heart disease

Systematic
2 Link Function
3 Distribution

Your model so far

P(HeartDisease = 1) = g(?)(1, Age, Sex, ChestPain, MaxHR, STDepression)

The link function g() determines how the linear combination of predictors relates to the probability of heart disease.

The Key Constraint

We're predicting a probability. This means our predictions must be bounded:

$0 \leq P(\text{HeartDisease} = 1) \leq 1$

The link function must map from the unbounded linear predictor $\eta = \beta_0 + \beta_1 X_1 + \ldots$ (which can be any real number) to a probability between 0 and 1.

Choose the Link Function

For predicting the probability of heart disease (a value between 0 and 1), which link function is most appropriate?

Click on a card to select it.

Link Function Selected: Logit

With the logit link, your model equation becomes:

$\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Sex} + \beta_3 \cdot \text{CP} + \beta_4 \cdot \text{MaxHR} + \beta_5 \cdot \text{STDep}$

What the coefficients mean

With the logit link, the coefficients are log-odds ratios:

  • $e^{\beta}$ gives the odds ratio for a one-unit increase in that predictor
  • $e^{\beta} > 1$ means increased odds of heart disease
  • $e^{\beta} < 1$ means decreased odds of heart disease

✔ Correct Choice!

The logit link is the standard choice for binary classification, giving us logistic regression.

The logit function maps probabilities to "log-odds":

$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)$

When $p = 0.5$, logit$(p) = 0$
When $p \to 0$, logit$(p) \to -\infty$
When $p \to 1$, logit$(p) \to +\infty$

Why logit dominates:

⚠ Valid, But Not Preferred Today

The probit link is mathematically valid for binary outcomes - it also maps probabilities to the real line. However, it's less commonly used than logit today.

Historical Context

Pre-1980s: Probit dominated in economics and bioassay (dose-response studies). It was developed by Chester Bliss in the 1930s for analyzing insecticide effectiveness.

Why probit was preferred then: The "latent variable" interpretation was appealing - assuming an underlying normally-distributed threshold that determines the binary outcome.

Why logit won: With modern computation, logit's advantages became clear:

  • Direct odds ratio interpretation
  • Simpler mathematics (no need for normal CDF tables)
  • Canonical link for Binomial in GLM framework
Practical note: Probit and logit give very similar predictions in practice:

$\beta_{\text{probit}} \approx \beta_{\text{logit}} \times 0.625$

Probit is still used in economics (tradition) and some dose-response studies.

For this tutorial, try the logit link - it's the modern standard and gives easier-to-interpret coefficients.

❌ Theoretically Wrong (But Sometimes Used)

The identity link doesn't bound predictions - it gives us the "linear probability model" (LPM):

$P(\text{HeartDisease} = 1) = \beta_0 + \beta_1 \cdot \text{Age} + \ldots$

The theoretical problem: Linear combinations can produce any real number, so this model can predict:

In Practice...

Despite being theoretically incorrect, the LPM is still used, especially in econometrics:

  • Easy interpretation: Coefficients are percentage point changes in probability
  • Often "good enough": When predicted probabilities stay near 0.5, out-of-bounds predictions are rare
  • Causal inference: Some economists prefer LPM for its simplicity in causal analysis

However, for serious prediction or when probabilities near 0 or 1 matter, logit is preferred.

For this tutorial, use the logit link to learn the proper GLM approach.

❌ Not Appropriate for Probabilities

The log link ensures predictions are positive, but doesn't bound them to be less than 1:

$\ln(p) = \beta_0 + \beta_1 \cdot \text{Age} + \ldots$

$p = e^{\beta_0 + \beta_1 \cdot \text{Age} + \ldots}$

The problem: The exponential function can produce values greater than 1:

The log link is the right choice when the response must be positive (like counts), but probabilities need a link that bounds to (0, 1).