Going Further: Binomial Log-Likelihood & IRLS Advanced

Understanding the mathematics behind logistic regression fitting

On This Page

1. The Binomial Likelihood

For binary outcomes (0 or 1), each observation follows a Bernoulli distribution (special case of Binomial with n=1):

$$y_i \sim \text{Bernoulli}(p_i) \quad \text{where} \quad \text{logit}(p_i) = \mathbf{x}_i^T \boldsymbol{\beta}$$

The probability mass function for a single observation is:

$$P(Y_i = y_i) = p_i^{y_i} (1-p_i)^{1-y_i}$$

Where $p_i$ is linked to our linear predictor via the logit function:

$$p_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \boldsymbol{\beta}}} = \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}}$$

Assuming independence, the likelihood for all n observations is:

$$L(\boldsymbol{\beta} | \mathbf{y}) = \prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}$$

2. Log-Likelihood for Logistic Regression

Taking the natural logarithm transforms the product into a sum:

1
$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$
The log-likelihood as a sum over observations.
2
$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log\left(\frac{p_i}{1-p_i}\right) + \log(1-p_i) \right]$
Rearrange using log rules.
3
$\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log(1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right]$
Substitute $\log(p/(1-p)) = \mathbf{x}^T\boldsymbol{\beta}$ and $\log(1-p) = -\log(1+e^{\eta})$
4
$\ell(\boldsymbol{\beta}) = \mathbf{y}^T \mathbf{X}\boldsymbol{\beta} - \sum_{i=1}^{n} \log(1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}})$
Final form in matrix notation. This is the binary cross-entropy loss (negated).
Connection to Cross-Entropy

The negative log-likelihood for logistic regression is exactly the binary cross-entropy loss used in machine learning. Minimizing cross-entropy = maximizing likelihood!

3. Why No Closed-Form Solution?

The score equations (gradient set to zero) are:

$$\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^{n} (y_i - p_i) \mathbf{x}_i = \mathbf{X}^T(\mathbf{y} - \mathbf{p}) = \mathbf{0}$$

The problem is that $\mathbf{p}$ depends on $\boldsymbol{\beta}$ through the logistic function:

$$p_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \boldsymbol{\beta}}}$$

This creates a non-linear system of equations. We cannot algebraically isolate $\boldsymbol{\beta}$ on one side. Compare to Gaussian GLM where:

Gaussian GLM (Tutorial 1) Logistic Regression (This Tutorial)
Score: $\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0}$ Score: $\mathbf{X}^T(\mathbf{y} - \mathbf{p}(\boldsymbol{\beta})) = \mathbf{0}$
Linear in $\boldsymbol{\beta}$ → solve directly Non-linear in $\boldsymbol{\beta}$ → iterate
Solution: $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ No closed-form solution

4. IRLS Algorithm Explained

Iteratively Reweighted Least Squares (IRLS) is the standard method for fitting GLMs. For logistic regression, it approximates the non-linear problem as a sequence of weighted linear regressions.

The Algorithm

1
Initialize: $\boldsymbol{\beta}^{(0)} = \mathbf{0}$
Start with all coefficients at zero.
2
Compute $\eta_i^{(t)} = \mathbf{x}_i^T \boldsymbol{\beta}^{(t)}$ and $p_i^{(t)} = \frac{1}{1+e^{-\eta_i^{(t)}}}$
Calculate current linear predictor and probabilities.
3
Compute weights: $w_i^{(t)} = p_i^{(t)}(1 - p_i^{(t)})$
Weights are the variance of Bernoulli at current $p_i$.
4
Compute working response: $z_i^{(t)} = \eta_i^{(t)} + \frac{y_i - p_i^{(t)}}{w_i^{(t)}}$
Linearize around current estimate.
5
Update: $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$
Weighted least squares! $\mathbf{W} = \text{diag}(w_1, \ldots, w_n)$
6
Repeat steps 2-5 until $\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon$
Typically converges in 3-7 iterations for well-behaved data.
Why "Fisher Scoring"?

IRLS for GLMs is equivalent to Newton-Raphson using the expected information matrix (Fisher information) instead of the observed Hessian. This guarantees positive definiteness and is called Fisher scoring. For canonical links like logit, observed = expected.

5. Implementation from Scratch

Let's implement logistic regression fitting using both numerical optimization and IRLS.

# Load and prepare data
heart <- read.csv(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
  header = FALSE
)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol",
                "fbs", "restecg", "thalach", "exang",
                "oldpeak", "slope", "ca", "thal", "target")
heart$target <- ifelse(heart$target > 0, 1, 0)

# Prepare design matrix and response
X <- cbind(1, heart$age, heart$sex, heart$cp,
           heart$thalach, heart$oldpeak)
y <- heart$target

# Logistic (inverse logit) function
logistic <- function(eta) 1 / (1 + exp(-eta))

# Negative log-likelihood
neg_loglik <- function(beta, X, y) {
  eta <- X %*% beta
  p <- logistic(eta)
  # Clip probabilities for numerical stability
  p <- pmax(pmin(p, 1 - 1e-10), 1e-10)
  -sum(y * log(p) + (1 - y) * log(1 - p))
}

# IRLS implementation
irls_logistic <- function(X, y, tol = 1e-8, max_iter = 25) {
  n <- nrow(X)
  p <- ncol(X)
  beta <- rep(0, p)  # Initialize at zero

  for (iter in 1:max_iter) {
    eta <- X %*% beta
    mu <- logistic(eta)

    # Weights = variance of Bernoulli
    w <- as.vector(mu * (1 - mu))
    w <- pmax(w, 1e-10)  # Prevent division by zero

    # Working response
    z <- eta + (y - mu) / w

    # Weighted least squares update
    W <- diag(w)
    XtWX <- t(X) %*% W %*% X
    XtWz <- t(X) %*% W %*% z
    beta_new <- solve(XtWX, XtWz)

    # Check convergence
    if (max(abs(beta_new - beta)) < tol) {
      cat("Converged in", iter, "iterations\n")
      break
    }
    beta <- as.vector(beta_new)
  }

  # Return coefficients and Fisher information
  list(beta = beta, fisher_info = XtWX, iterations = iter)
}

# Fit using our IRLS
cat("=== IRLS from scratch ===\n")
result <- irls_logistic(X, y)
se <- sqrt(diag(solve(result$fisher_info)))
print(data.frame(
  Estimate = round(result$beta, 4),
  Std.Error = round(se, 4),
  row.names = c("Intercept", "age", "sex", "cp", "thalach", "oldpeak")
))

# Compare with glm()
cat("\n=== Comparison with glm() ===\n")
glm_fit <- glm(target ~ age + sex + cp + thalach + oldpeak,
               data = heart, family = binomial())
print(coef(glm_fit))

Expected Output

=== IRLS from scratch === Converged in 5 iterations Estimate Std.Error Intercept -3.1655 2.0253 age 0.0359 0.0188 sex 1.6745 0.3506 cp 0.8963 0.1702 thalach -0.0247 0.0080 oldpeak 0.6829 0.1528 === Comparison with glm() === (Intercept) age sex cp thalach oldpeak -3.16551839 0.03593889 1.67450111 0.89626503 -0.02466255 0.68288325
import numpy as np
import pandas as pd
from scipy.optimize import minimize
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Load and prepare data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cols = ["age", "sex", "cp", "trestbps", "chol", "fbs",
        "restecg", "thalach", "exang", "oldpeak",
        "slope", "ca", "thal", "target"]
heart = pd.read_csv(url, header=None, names=cols)
heart['target'] = (heart['target'] > 0).astype(int)

# Prepare design matrix and response
X = np.column_stack([
    np.ones(len(heart)),
    heart['age'], heart['sex'], heart['cp'],
    heart['thalach'], heart['oldpeak']
])
y = heart['target'].values

def logistic(eta):
    """Logistic (inverse logit) function"""
    return 1 / (1 + np.exp(-np.clip(eta, -500, 500)))

def irls_logistic(X, y, tol=1e-8, max_iter=25):
    """IRLS for logistic regression"""
    n, p = X.shape
    beta = np.zeros(p)  # Initialize at zero

    for iteration in range(1, max_iter + 1):
        eta = X @ beta
        mu = logistic(eta)

        # Weights = variance of Bernoulli
        w = mu * (1 - mu)
        w = np.maximum(w, 1e-10)  # Prevent division by zero

        # Working response
        z = eta + (y - mu) / w

        # Weighted least squares update
        W = np.diag(w)
        XtWX = X.T @ W @ X
        XtWz = X.T @ W @ z
        beta_new = np.linalg.solve(XtWX, XtWz)

        # Check convergence
        if np.max(np.abs(beta_new - beta)) < tol:
            print(f"Converged in {iteration} iterations")
            break
        beta = beta_new

    return {'beta': beta, 'fisher_info': XtWX, 'iterations': iteration}

# Fit using our IRLS
print("=== IRLS from scratch ===")
result = irls_logistic(X, y)
se = np.sqrt(np.diag(np.linalg.inv(result['fisher_info'])))

params_df = pd.DataFrame({
    'Estimate': np.round(result['beta'], 4),
    'Std.Error': np.round(se, 4)
}, index=['Intercept', 'age', 'sex', 'cp', 'thalach', 'oldpeak'])
print(params_df)

# Compare with statsmodels
print("\n=== Comparison with statsmodels ===")
glm_fit = smf.glm(
    formula="target ~ age + sex + cp + thalach + oldpeak",
    data=heart,
    family=sm.families.Binomial()
).fit()
print(glm_fit.params)

Expected Output

=== IRLS from scratch === Converged in 5 iterations Estimate Std.Error Intercept -3.1655 2.0253 age 0.0359 0.0188 sex 1.6745 0.3506 cp 0.8963 0.1702 thalach -0.0247 0.0080 oldpeak 0.6829 0.1528 === Comparison with statsmodels === Intercept -3.165518 age 0.035939 sex 1.674501 cp 0.896265 thalach -0.024663 oldpeak 0.682883 dtype: float64
Key Takeaways

1. Binomial log-likelihood is the binary cross-entropy loss (negated)

2. No closed-form solution because $p_i$ depends non-linearly on $\boldsymbol{\beta}$

3. IRLS converts the problem into a sequence of weighted least squares problems

4. Fisher scoring = Newton-Raphson with expected Hessian; converges in ~5 iterations