Four parameters: intercept (β₀) and three slopes (β₁, β₂, β₃)
Adjust four sliders simultaneously. Notice how hard it is to find the optimal combination!
| Parameters | Likelihood Space | Can We Visualize? | Typical Use Case |
|---|---|---|---|
| 1 (mean only) | 2D (1 param + likelihood) | Yes — a curve | Simple estimation |
| 2 (line fit) | 3D (2 params + likelihood) | Yes — a surface | Simple linear regression |
| 3 (multiple reg.) | 4D (3 params + likelihood) | Barely — color as 4th dim | Multiple regression |
| 4 (this page) | 5D (4 params + likelihood) | No — must use projections | Multiple regression |
| 10–50 | 11D–51D | No | Typical statistical GLMs |
| 100–1,000 | 101D–1,001D | No | Large surveys, genomics |
| 10⁷–10⁸ | Tens of millions D | No | Image classifiers, small NNs |
| 10⁹–10¹² | Billions–trillions D | No | Large language models (GPT, Claude) |
Note: Statistical models (GLMs, mixed models) typically have tens to hundreds of parameters, carefully chosen based on domain knowledge. Machine learning and AI models range from millions (image classifiers) to trillions (large language models like GPT-4). The fundamental optimization principles are the same — follow the gradient — but specialized algorithms (SGD, Adam) that work on mini-batches are needed at scale.
Whether we have 2 parameters or 200, the gradient still points toward the maximum, the Hessian still captures the curvature, and Newton-Raphson still converges in a few steps. The algorithm is dimension-agnostic.
Consider what the algorithms are doing in 4D:
We can't "see" the 4D likelihood surface, but we can prove the algorithm found the maximum: the gradient is zero at the MLE, and the Hessian is negative definite. Maths gives us certainty where visualisation fails.
Statistical models (like GLMs) typically have tens to hundreds of parameters, each carefully chosen to represent a meaningful effect. A logistic regression predicting disease risk might have 20–50 predictors, each with a clear interpretation.
Machine learning and AI models scale to much higher dimensions:
At these scales, exact methods like Newton-Raphson become impractical (inverting a billion×billion matrix is impossible). Instead, AI training uses stochastic gradient descent (SGD) and variants like Adam — computing approximate gradients on small batches of data and taking many small steps. The core principle is the same: follow the gradient toward better parameters.