← All Tutorials

Going Further: Log-Likelihood & Maximum Likelihood Estimation Advanced

Understanding the mathematics behind GLM fitting

On This Page

1. The Likelihood Function

Maximum Likelihood Estimation (MLE) finds the parameter values that make the observed data most probable. For a Gaussian GLM with identity link, we model:

$$y_i \sim \text{Normal}(\mu_i, \sigma^2) \quad \text{where} \quad \mu_i = \mathbf{x}_i^T \boldsymbol{\beta}$$

In matrix notation, with $\mathbf{X}$ as our $n \times p$ design matrix (including intercept column) and $\mathbf{y}$ as our $n \times 1$ response vector:

$$\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta}$$

The probability density for a single observation is:

$$f(y_i | \mu_i, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp\left(-\frac{(y_i - \mu_i)^2}{2\sigma^2}\right)$$

Assuming independence, the likelihood for all n observations is the product:

$$L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}) = \prod_{i=1}^{n} f(y_i | \mu_i, \sigma^2)$$

2. Log-Likelihood for Gaussian GLM

Working with products is unwieldy, so we take the natural logarithm. Since log is monotonic, maximising the log-likelihood gives the same estimates as maximising the likelihood.

1
$\ell(\boldsymbol{\beta}, \sigma^2) = \log L(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}) = \sum_{i=1}^{n} \log f(y_i | \mu_i, \sigma^2)$
Take the log of the product, which becomes a sum of logs.
2
$\ell = \sum_{i=1}^{n} \left[ -\frac{1}{2} \log(2\pi\sigma^2) - \frac{(y_i - \mu_i)^2}{2\sigma^2} \right]$
Substitute the normal density and apply log rules.
3
$\ell = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - \mu_i)^2$
Separate terms. The sum of squared residuals appears!
4
$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$
Final form in matrix notation. The sum of squared residuals is $(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$
Key Insight

For fixed $\sigma^2$, maximising the log-likelihood is equivalent to minimising the sum of squared residuals - exactly what OLS does! This is why OLS and MLE give identical coefficient estimates for Gaussian linear models.

3. Deriving the Score Equations

To find the MLE, we differentiate the log-likelihood with respect to $\boldsymbol{\beta}$ and set equal to zero. The derivative of the log-likelihood is called the score function.

Score for $\boldsymbol{\beta}$

1
$\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \frac{\partial}{\partial \boldsymbol{\beta}} \left[ -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right]$
Only the residual term depends on $\boldsymbol{\beta}$.
2
$= -\frac{1}{2\sigma^2} \times (-2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}))$
Apply matrix calculus: $\frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) = -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$
3
$= \frac{1}{\sigma^2} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$
Simplify.

Setting Score to Zero

4
$\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0}$
Set score equal to zero (the $\sigma^2$ cancels).
5
$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$
The normal equations - familiar from OLS!
6
$\boldsymbol{\beta}_{\text{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
The closed-form solution! Multiply both sides by $(\mathbf{X}^T\mathbf{X})^{-1}$.

MLE for $\sigma^2$

Similarly, differentiating with respect to $\sigma^2$ and solving:

$$\sigma^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T\boldsymbol{\beta}_{\text{MLE}})^2 = \frac{\text{RSS}}{n}$$
Note on Degrees of Freedom

The MLE for $\sigma^2$ divides by $n$, not $(n-p)$ as in the unbiased OLS estimator. R's glm() reports the "dispersion parameter" using $(n-p)$, which is why you may see slight differences.

4. Implementation from Scratch

Let's implement MLE using numerical optimisation. We'll define the negative log-likelihood (since optimisers typically minimise) and use general-purpose optimisation.

# Load and prepare data
heart <- read.csv("heart.csv", header = FALSE)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol",
                "fbs", "restecg", "thalach", "exang",
                "oldpeak", "slope", "ca", "thal", "num")

# Prepare design matrix X and response y
X <- cbind(1, heart$age, heart$exang, heart$oldpeak)
y <- heart$thalach
n <- length(y)

# Define negative log-likelihood function
neg_loglik <- function(params, X, y) {
  p <- ncol(X)
  beta <- params[1:p]
  sigma2 <- exp(params[p + 1])  # exp() ensures sigma2 > 0

  mu <- X %*% beta
  resid <- y - mu
  n <- length(y)

  # Negative log-likelihood
  nll <- (n/2) * log(2 * pi * sigma2) +
         sum(resid^2) / (2 * sigma2)

  return(nll)
}

# Initial values (using OLS estimates as starting point)
beta_init <- coef(lm(y ~ X[,2:4]))
sigma2_init <- var(y)
params_init <- c(beta_init, log(sigma2_init))

# Optimise using optim()
result <- optim(
  par = params_init,
  fn = neg_loglik,
  X = X,
  y = y,
  method = "BFGS",
  hessian = TRUE
)

# Extract estimates
beta_mle <- result$par[1:4]
sigma2_mle <- exp(result$par[5])

# Standard errors from Hessian
se <- sqrt(diag(solve(result$hessian)))[1:4]

# Display results
cat("=== MLE from optim() ===\n")
cat("Coefficients:\n")
print(data.frame(
  Estimate = round(beta_mle, 4),
  Std.Error = round(se, 4),
  row.names = c("Intercept", "age", "exang", "oldpeak")
))
cat("\nSigma^2 (MLE):", round(sigma2_mle, 4), "\n")
cat("Log-likelihood:", round(-result$value, 2), "\n")

# Compare with glm()
cat("\n=== Comparison with glm() ===\n")
glm_fit <- glm(thalach ~ age + exang + oldpeak,
               data = heart, family = gaussian())
print(coef(glm_fit))

Expected Output

=== MLE from optim() === Coefficients: Estimate Std.Error Intercept 203.3660 6.7011 age -0.8298 0.1238 exang -14.2542 2.4357 oldpeak -3.7805 1.0024 Sigma^2 (MLE): 362.1373 Log-likelihood: -1322.58 === Comparison with glm() === (Intercept) age exang oldpeak 203.3659562 -0.8297581 -14.2541607 -3.7805194
import numpy as np
import pandas as pd
from scipy.optimize import minimize
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Load and prepare data
col_names = ['age', 'sex', 'cp', 'trestbps', 'chol',
             'fbs', 'restecg', 'thalach', 'exang',
             'oldpeak', 'slope', 'ca', 'thal', 'num']
heart = pd.read_csv("heart.csv", names=col_names)

# Prepare design matrix X and response y
X = np.column_stack([
    np.ones(len(heart)),
    heart['age'],
    heart['exang'],
    heart['oldpeak']
])
y = heart['thalach'].values
n = len(y)

def neg_loglik(params, X, y):
    """Negative log-likelihood for Gaussian GLM"""
    p = X.shape[1]
    beta = params[:p]
    sigma2 = np.exp(params[p])  # exp() ensures sigma2 > 0

    mu = X @ beta
    resid = y - mu
    n = len(y)

    # Negative log-likelihood
    nll = (n/2) * np.log(2 * np.pi * sigma2) + \
          np.sum(resid**2) / (2 * sigma2)

    return nll

# Initial values
beta_init = np.linalg.lstsq(X, y, rcond=None)[0]
sigma2_init = np.var(y)
params_init = np.append(beta_init, np.log(sigma2_init))

# Optimise using scipy.optimize.minimize
result = minimize(
    neg_loglik,
    params_init,
    args=(X, y),
    method='BFGS'
)

# Extract estimates
beta_mle = result.x[:4]
sigma2_mle = np.exp(result.x[4])

# Standard errors from inverse Hessian
se = np.sqrt(np.diag(result.hess_inv))[:4]

# Display results
print("=== MLE from scipy.optimize ===")
print("Coefficients:")
params_df = pd.DataFrame({
    'Estimate': np.round(beta_mle, 4),
    'Std.Error': np.round(se, 4)
}, index=['Intercept', 'age', 'exang', 'oldpeak'])
print(params_df)
print(f"\nSigma^2 (MLE): {sigma2_mle:.4f}")
print(f"Log-likelihood: {-result.fun:.2f}")

# Compare with statsmodels
print("\n=== Comparison with statsmodels GLM ===")
glm_fit = smf.glm(
    formula="thalach ~ age + exang + oldpeak",
    data=heart,
    family=sm.families.Gaussian()
).fit()
print(glm_fit.params)

Expected Output

=== MLE from scipy.optimize === Coefficients: Estimate Std.Error Intercept 203.3660 6.7011 age -0.8298 0.1238 exang -14.2542 2.4357 oldpeak -3.7805 1.0024 Sigma^2 (MLE): 362.1373 Log-likelihood: -1322.58 === Comparison with statsmodels GLM === Intercept 203.365956 age -0.829758 exang -14.254161 oldpeak -3.780519 dtype: float64

Note: Standard errors from BFGS's approximate inverse Hessian may differ slightly from R's numerical Hessian. For exact SEs, use scipy.optimize.approx_fprime or numdifftools.

5. Comparing Results

The coefficient estimates from our manual MLE implementation match the built-in functions exactly. Any small differences arise from:

Source of Difference Explanation
Standard errors Built-in functions may use observed vs expected information matrix, or different finite-sample corrections
Dispersion estimate MLE uses n in denominator; R's glm() uses (n-p) for the unbiased estimate
Convergence tolerance Different optimisers use different convergence criteria
Numerical precision Floating-point arithmetic can introduce tiny differences
Why Learn This?

Understanding the log-likelihood and its optimisation helps you: (1) debug when models don't converge, (2) extend GLMs to custom distributions, (3) understand model comparison via likelihood ratio tests and AIC, (4) appreciate what the computer is actually doing behind "fit model".

Key Takeaways