Understanding the mathematics behind GLM fitting
Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable. For a Gaussian GLM with identity link, we model:
In matrix notation, with $\mathbf{X}$ as our $n \times p$ design matrix (including intercept column) and $\mathbf{y}$ as our $n \times 1$ response vector:
The probability density for a single observation is:
Assuming independence, the likelihood for all n observations is the product:
Working with products is unwieldy, so we take the natural logarithm. Since log is monotonic, maximising the log-likelihood gives the same estimates as maximising the likelihood.
For fixed $\sigma^2$, maximising the log-likelihood is equivalent to minimising the sum of squared residuals - exactly what OLS does! This is why OLS and MLE give identical coefficient estimates for Gaussian linear models.
To find the MLE, we differentiate the log-likelihood with respect to $\boldsymbol{\beta}$ and set equal to zero. The derivative of the log-likelihood is called the score function.
Similarly, differentiating with respect to $\sigma^2$ and solving:
The MLE for $\sigma^2$ divides by $n$, not $(n-p)$ as in the unbiased OLS estimator.
R's glm() reports the "dispersion parameter" using $(n-p)$, which is why
you may see slight differences.
Let's implement MLE using numerical optimisation. We'll define the negative log-likelihood (since optimisers typically minimise) and use general-purpose optimisation.
# Load and prepare data
heart <- read.csv("heart.csv", header = FALSE)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol",
"fbs", "restecg", "thalach", "exang",
"oldpeak", "slope", "ca", "thal", "num")
# Prepare design matrix X and response y
X <- cbind(1, heart$age, heart$exang, heart$oldpeak)
y <- heart$thalach
n <- length(y)
# Define negative log-likelihood function
neg_loglik <- function(params, X, y) {
p <- ncol(X)
beta <- params[1:p]
sigma2 <- exp(params[p + 1]) # exp() ensures sigma2 > 0
mu <- X %*% beta
resid <- y - mu
n <- length(y)
# Negative log-likelihood
nll <- (n/2) * log(2 * pi * sigma2) +
sum(resid^2) / (2 * sigma2)
return(nll)
}
# Initial values (using OLS estimates as starting point)
beta_init <- coef(lm(y ~ X[,2:4]))
sigma2_init <- var(y)
params_init <- c(beta_init, log(sigma2_init))
# Optimise using optim()
result <- optim(
par = params_init,
fn = neg_loglik,
X = X,
y = y,
method = "BFGS",
hessian = TRUE
)
# Extract estimates
beta_mle <- result$par[1:4]
sigma2_mle <- exp(result$par[5])
# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))[1:4]
# Display results
cat("=== MLE from optim() ===\n")
cat("Coefficients:\n")
print(data.frame(
Estimate = round(beta_mle, 4),
Std.Error = round(se, 4),
row.names = c("Intercept", "age", "exang", "oldpeak")
))
cat("\nSigma^2 (MLE):", round(sigma2_mle, 4), "\n")
cat("Log-likelihood:", round(-result$value, 2), "\n")
# Compare with glm()
cat("\n=== Comparison with glm() ===\n")
glm_fit <- glm(thalach ~ age + exang + oldpeak,
data = heart, family = gaussian())
print(coef(glm_fit))
import numpy as np
import pandas as pd
from scipy.optimize import minimize
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Load and prepare data
col_names = ['age', 'sex', 'cp', 'trestbps', 'chol',
'fbs', 'restecg', 'thalach', 'exang',
'oldpeak', 'slope', 'ca', 'thal', 'num']
heart = pd.read_csv("heart.csv", names=col_names)
# Prepare design matrix X and response y
X = np.column_stack([
np.ones(len(heart)),
heart['age'],
heart['exang'],
heart['oldpeak']
])
y = heart['thalach'].values
n = len(y)
def neg_loglik(params, X, y):
"""Negative log-likelihood for Gaussian GLM"""
p = X.shape[1]
beta = params[:p]
sigma2 = np.exp(params[p]) # exp() ensures sigma2 > 0
mu = X @ beta
resid = y - mu
n = len(y)
# Negative log-likelihood
nll = (n/2) * np.log(2 * np.pi * sigma2) + \
np.sum(resid**2) / (2 * sigma2)
return nll
# Initial values
beta_init = np.linalg.lstsq(X, y, rcond=None)[0]
sigma2_init = np.var(y)
params_init = np.append(beta_init, np.log(sigma2_init))
# Optimise using scipy.optimize.minimize
result = minimize(
neg_loglik,
params_init,
args=(X, y),
method='BFGS'
)
# Extract estimates
beta_mle = result.x[:4]
sigma2_mle = np.exp(result.x[4])
# Standard errors from inverse Hessian
se = np.sqrt(np.diag(result.hess_inv))[:4]
# Display results
print("=== MLE from scipy.optimize ===")
print("Coefficients:")
params_df = pd.DataFrame({
'Estimate': np.round(beta_mle, 4),
'Std.Error': np.round(se, 4)
}, index=['Intercept', 'age', 'exang', 'oldpeak'])
print(params_df)
print(f"\nSigma^2 (MLE): {sigma2_mle:.4f}")
print(f"Log-likelihood: {-result.fun:.2f}")
# Compare with statsmodels
print("\n=== Comparison with statsmodels GLM ===")
glm_fit = smf.glm(
formula="thalach ~ age + exang + oldpeak",
data=heart,
family=sm.families.Gaussian()
).fit()
print(glm_fit.params)
Note: Standard errors from BFGS's approximate inverse Hessian
may differ slightly from R's numerical Hessian. For exact SEs, use
scipy.optimize.approx_fprime or numdifftools.
The coefficient estimates from our manual MLE implementation match the built-in functions exactly. Any small differences arise from:
| Source of Difference | Explanation |
|---|---|
| Standard errors | Built-in functions may use observed vs expected information matrix, or different finite-sample corrections |
| Dispersion estimate |
MLE uses n in denominator; R's glm() uses (n-p) for the unbiased estimate
|
| Convergence tolerance | Different optimisers use different convergence criteria |
| Numerical precision | Floating-point arithmetic can introduce tiny differences |
Understanding the log-likelihood and its optimisation helps you: (1) debug when models don't converge, (2) extend GLMs to custom distributions, (3) understand model comparison via likelihood ratio tests and AIC, (4) appreciate what the computer is actually doing behind "fit model".
optim() and scipy.minimize()
can fit any model where you can write down the likelihood