Understanding the mathematics behind logistic regression fitting
For binary outcomes (0 or 1), each observation follows a Bernoulli distribution (special case of Binomial with n=1):
The probability mass function for a single observation is:
Where $p_i$ is linked to our linear predictor via the logit function:
Assuming independence, the likelihood for all n observations is:
Taking the natural logarithm transforms the product into a sum:
The negative log-likelihood for logistic regression is exactly the binary cross-entropy loss used in machine learning. Minimizing cross-entropy = maximizing likelihood!
The score equations (gradient set to zero) are:
The problem is that $\mathbf{p}$ depends on $\boldsymbol{\beta}$ through the logistic function:
This creates a non-linear system of equations. We cannot algebraically isolate $\boldsymbol{\beta}$ on one side. Compare to Gaussian GLM where:
| Gaussian GLM (Tutorial 1) | Logistic Regression (This Tutorial) |
|---|---|
| Score: $\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0}$ | Score: $\mathbf{X}^T(\mathbf{y} - \mathbf{p}(\boldsymbol{\beta})) = \mathbf{0}$ |
| Linear in $\boldsymbol{\beta}$ → solve directly | Non-linear in $\boldsymbol{\beta}$ → iterate |
| Solution: $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | No closed-form solution |
Iteratively Reweighted Least Squares (IRLS) is the standard method for fitting GLMs. For logistic regression, it approximates the non-linear problem as a sequence of weighted linear regressions.
IRLS for GLMs is equivalent to Newton-Raphson using the expected information matrix (Fisher information) instead of the observed Hessian. This guarantees positive definiteness and is called Fisher scoring. For canonical links like logit, observed = expected.
Let's implement logistic regression fitting using both numerical optimization and IRLS.
# Load and prepare data
heart <- read.csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
header = FALSE
)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol",
"fbs", "restecg", "thalach", "exang",
"oldpeak", "slope", "ca", "thal", "target")
heart$target <- ifelse(heart$target > 0, 1, 0)
# Prepare design matrix and response
X <- cbind(1, heart$age, heart$sex, heart$cp,
heart$thalach, heart$oldpeak)
y <- heart$target
# Logistic (inverse logit) function
logistic <- function(eta) 1 / (1 + exp(-eta))
# Negative log-likelihood
neg_loglik <- function(beta, X, y) {
eta <- X %*% beta
p <- logistic(eta)
# Clip probabilities for numerical stability
p <- pmax(pmin(p, 1 - 1e-10), 1e-10)
-sum(y * log(p) + (1 - y) * log(1 - p))
}
# IRLS implementation
irls_logistic <- function(X, y, tol = 1e-8, max_iter = 25) {
n <- nrow(X)
p <- ncol(X)
beta <- rep(0, p) # Initialize at zero
for (iter in 1:max_iter) {
eta <- X %*% beta
mu <- logistic(eta)
# Weights = variance of Bernoulli
w <- as.vector(mu * (1 - mu))
w <- pmax(w, 1e-10) # Prevent division by zero
# Working response
z <- eta + (y - mu) / w
# Weighted least squares update
W <- diag(w)
XtWX <- t(X) %*% W %*% X
XtWz <- t(X) %*% W %*% z
beta_new <- solve(XtWX, XtWz)
# Check convergence
if (max(abs(beta_new - beta)) < tol) {
cat("Converged in", iter, "iterations\n")
break
}
beta <- as.vector(beta_new)
}
# Return coefficients and Fisher information
list(beta = beta, fisher_info = XtWX, iterations = iter)
}
# Fit using our IRLS
cat("=== IRLS from scratch ===\n")
result <- irls_logistic(X, y)
se <- sqrt(diag(solve(result$fisher_info)))
print(data.frame(
Estimate = round(result$beta, 4),
Std.Error = round(se, 4),
row.names = c("Intercept", "age", "sex", "cp", "thalach", "oldpeak")
))
# Compare with glm()
cat("\n=== Comparison with glm() ===\n")
glm_fit <- glm(target ~ age + sex + cp + thalach + oldpeak,
data = heart, family = binomial())
print(coef(glm_fit))
import numpy as np
import pandas as pd
from scipy.optimize import minimize
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Load and prepare data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cols = ["age", "sex", "cp", "trestbps", "chol", "fbs",
"restecg", "thalach", "exang", "oldpeak",
"slope", "ca", "thal", "target"]
heart = pd.read_csv(url, header=None, names=cols)
heart['target'] = (heart['target'] > 0).astype(int)
# Prepare design matrix and response
X = np.column_stack([
np.ones(len(heart)),
heart['age'], heart['sex'], heart['cp'],
heart['thalach'], heart['oldpeak']
])
y = heart['target'].values
def logistic(eta):
"""Logistic (inverse logit) function"""
return 1 / (1 + np.exp(-np.clip(eta, -500, 500)))
def irls_logistic(X, y, tol=1e-8, max_iter=25):
"""IRLS for logistic regression"""
n, p = X.shape
beta = np.zeros(p) # Initialize at zero
for iteration in range(1, max_iter + 1):
eta = X @ beta
mu = logistic(eta)
# Weights = variance of Bernoulli
w = mu * (1 - mu)
w = np.maximum(w, 1e-10) # Prevent division by zero
# Working response
z = eta + (y - mu) / w
# Weighted least squares update
W = np.diag(w)
XtWX = X.T @ W @ X
XtWz = X.T @ W @ z
beta_new = np.linalg.solve(XtWX, XtWz)
# Check convergence
if np.max(np.abs(beta_new - beta)) < tol:
print(f"Converged in {iteration} iterations")
break
beta = beta_new
return {'beta': beta, 'fisher_info': XtWX, 'iterations': iteration}
# Fit using our IRLS
print("=== IRLS from scratch ===")
result = irls_logistic(X, y)
se = np.sqrt(np.diag(np.linalg.inv(result['fisher_info'])))
params_df = pd.DataFrame({
'Estimate': np.round(result['beta'], 4),
'Std.Error': np.round(se, 4)
}, index=['Intercept', 'age', 'sex', 'cp', 'thalach', 'oldpeak'])
print(params_df)
# Compare with statsmodels
print("\n=== Comparison with statsmodels ===")
glm_fit = smf.glm(
formula="target ~ age + sex + cp + thalach + oldpeak",
data=heart,
family=sm.families.Binomial()
).fit()
print(glm_fit.params)
1. Binomial log-likelihood is the binary cross-entropy loss (negated)
2. No closed-form solution because $p_i$ depends non-linearly on $\boldsymbol{\beta}$
3. IRLS converts the problem into a sequence of weighted least squares problems
4. Fisher scoring = Newton-Raphson with expected Hessian; converges in ~5 iterations