The connection between walking uphill and fitting a statistical model
In the optimisation pages, you watched algorithms climb a terrain — gradient ascent following the steepest slope, Newton-Raphson using curvature to leap toward the peak, simulated annealing randomly wandering to escape local hills.
That terrain wasn't just a convenient visual metaphor. In statistics, the terrain is the log-likelihood surface. Every point on the map represents a candidate set of model parameters, and the elevation at that point is how well those parameters explain the observed data.
| Terrain Concept | Statistical Concept |
|---|---|
| Map coordinates (x, y) | Parameter values ($\beta_0, \beta_1, \ldots$) |
| Elevation at a point | Log-likelihood $\ell(\beta | \text{data})$ |
| Peak / summit | Maximum likelihood estimate (MLE) |
| Slope (gradient) | Score function $\nabla \ell(\beta)$ |
| Curvature (Hessian) | Observed information matrix $-\nabla^2 \ell(\beta)$ |
| Confidence ellipse at peak | Approximate confidence region for $\beta$ |
| Sharp peak vs broad plateau | Small vs large standard errors |
The middle panel shows the log-likelihood parameterized directly in terms of the mean $\mu$ (no link function). The right panel shows the same log-likelihood after applying the canonical link transformation. Compare how the link function reshapes the surface into something that looks like the smooth terrain on the left.
Left: a smooth Gaussian-bump terrain. Middle: log-likelihood parameterized directly in $\mu$. Right: the same log-likelihood after applying the canonical link function. The link transforms the awkward middle surface into the well-behaved right surface.
Without a link function, the log-likelihood surface can be badly shaped: cliffs where parameters hit constraints ($\mu > 0$ for counts, $0 < p < 1$ for probabilities), asymmetric curvature, and steep drop-offs. Optimisation algorithms struggle on such surfaces.
Each GLM family belongs to the exponential family, which has a natural parameter $\theta$. The canonical link sets $\eta = \theta$, giving the log-likelihood a common structure:
$\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \theta_i - b(\theta_i) \right] + \text{const}$
where $\theta_i = \eta_i = \beta_0 + \beta_1 x_i$. This is always concave in $\beta$ — a smooth hill with a single peak, exactly like the terrain metaphor. The different families only change $b(\theta)$:
| Family | Canonical Link | $b(\theta)$ | Surface Shape |
|---|---|---|---|
| Gaussian | Identity: $\eta = \mu$ | $\theta^2/2$ | Exact quadratic bowl |
| Binomial | Logit: $\eta = \log\frac{p}{1-p}$ | $\log(1 + e^\theta)$ | Smooth concave hill |
| Poisson | Log: $\eta = \log\mu$ | $e^\theta$ | Smooth concave hill |
If we parameterize Poisson regression as $\mu = \beta_0 + \beta_1 x$ (identity link), the surface has a cliff where $\mu$ approaches zero (log-likelihood plummets to $-\infty$). For logistic regression with $p = \beta_0 + \beta_1 x$ (linear probability model), the surface has walls at $p = 0$ and $p = 1$. These constraints make the surface hard to optimise and the terrain metaphor breaks down.
Apply log: $\log\mu = \beta_0 + \beta_1 x$ for Poisson. Apply logit: $\log\frac{p}{1-p} = \beta_0 + \beta_1 x$ for logistic. Now the parameters are unconstrained, the surface is concave, and gradient ascent or Newton-Raphson will find the peak reliably — just like climbing a hill on a map.
Every tutorial's "fitting" page is about climbing a log-likelihood surface. Every tutorial's "advanced" page derives the shape of that surface from the probability model. And every optimisation visualisation demonstrates the algorithms that do the climbing.