4D+ Optimization: Beyond Human Visualization

Multiple Regression: y = β₀ + β₁x₁ + β₂x₂ + β₃x₃

Since we can't see 4D, choose a 3D projection (fixing one parameter):

Fix β₀ → Show (β₁, β₂, β₃) Fix β₁ → Show (β₀, β₂, β₃) Fix β₂ → Show (β₀, β₁, β₃) Fix β₃ → Show (β₀, β₁, β₂)

3D Projection of 4D Likelihood

Algorithm Progress

Watch the algorithm navigate 4D space — shown here as distance from MLE over iterations

Navigate 4D Parameter Space

Adjust four sliders simultaneously. Notice how hard it is to find the optimal combination!

β₀ (intercept): 5.0

β₁ (slope x₁): 0.0

β₂ (slope x₂): 0.0

β₃ (slope x₃): 0.0

β₀

β₁

β₂

β₃

Log-ℓ

Iterations

The Dimensionality Challenge

Parameters	Likelihood Space	Can We Visualize?	Typical Use Case
1 (mean only)	2D (1 param + likelihood)	Yes — a curve	Simple estimation
2 (line fit)	3D (2 params + likelihood)	Yes — a surface	Simple linear regression
3 (multiple reg.)	4D (3 params + likelihood)	Barely — color as 4th dim	Multiple regression
4 (this page)	5D (4 params + likelihood)	No — must use projections	Multiple regression
10–50	11D–51D	No	Typical statistical GLMs
100–1,000	101D–1,001D	No	Large surveys, genomics
10⁷–10⁸	Tens of millions D	No	Image classifiers, small NNs
10⁹–10¹²	Billions–trillions D	No	Large language models (GPT, Claude)

Note: Statistical models (GLMs, mixed models) typically have tens to hundreds of parameters, carefully chosen based on domain knowledge. Machine learning and AI models range from millions (image classifiers) to trillions (large language models like GPT-4). The fundamental optimization principles are the same — follow the gradient — but specialized algorithms (SGD, Adam) that work on mini-batches are needed at scale.

Why This Matters

The Maths Doesn't Change

Whether we have 2 parameters or 200, the gradient still points toward the maximum, the Hessian still captures the curvature, and Newton-Raphson still converges in a few steps. The algorithm is dimension-agnostic.

Consider what the algorithms are doing in 4D:

Gradient Descent: Computes 4 partial derivatives (∂ℓ/∂β₀, ∂ℓ/∂β₁, ∂ℓ/∂β₂, ∂ℓ/∂β₃) and moves in that direction
Newton-Raphson: Uses a 4×4 Hessian matrix to find the step that accounts for curvature in all directions simultaneously
OLS Analytic: Inverts a 4×4 matrix (X'X) — same formula as 2D, just bigger matrices

Human Intuition vs. Mathematical Certainty

We can't "see" the 4D likelihood surface, but we can prove the algorithm found the maximum: the gradient is zero at the MLE, and the Hessian is negative definite. Maths gives us certainty where visualisation fails.

Statistics vs. Machine Learning

Statistical models (like GLMs) typically have tens to hundreds of parameters, each carefully chosen to represent a meaningful effect. A logistic regression predicting disease risk might have 20–50 predictors, each with a clear interpretation.

Machine learning and AI models scale to much higher dimensions:

ResNet-50 (image classification): ~25 million parameters (10⁷)
BERT (language understanding): 340 million parameters (10⁸)
GPT-2: 1.5 billion parameters (10⁹)
GPT-3: 175 billion parameters (10¹¹)
GPT-4: ~1.7 trillion parameters estimated (10¹²)

At these scales, exact methods like Newton-Raphson become impractical (inverting a billion×billion matrix is impossible). Instead, AI training uses stochastic gradient descent (SGD) and variants like Adam — computing approximate gradients on small batches of data and taking many small steps. The core principle is the same: follow the gradient toward better parameters.