GLM Tutorial: Logistic Regression Implementation

✓ Systematic

✓ Link

✓ Distribution

✓ Fitting

5 Implementation

Your Complete Model

$\text{HeartDisease} \sim \text{Binomial}(1, p)$ where $\text{logit}(p) = \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Sex} + \beta_3 \cdot \text{CP} + \beta_4 \cdot \text{MaxHR} + \beta_5 \cdot \text{STDep}$

Response

Heart Disease (0/1)

Link

Logit

Family

Binomial

Method

IRLS (5 iterations)

        1 Load the data
        # Load the UCI Heart Disease dataset
heart <- read.csv(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
  header = FALSE
)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol", "fbs",
                "restecg", "thalach", "exang", "oldpeak",
                "slope", "ca", "thal", "target")

# Binarise target: 0 = no disease, 1 = disease
heart$target <- ifelse(heart$target > 0, 1, 0)

# Preview the data
head(heart)
      
2 Fit the GLM# Fit logistic regression using glm()
model <- glm(
  target ~ age + sex + cp + thalach + oldpeak,
  data = heart,
  family = binomial(link = "logit")
)
          Note: family = binomial(link = "logit") specifies logistic regression.
          The logit link is the default for binomial, so you could also write just family = binomial.
        

        3 View results
        # Summary of fitted model
summary(model)

# Extract coefficients (log-odds scale)
coef(model)

# Calculate odds ratios
exp(coef(model))

# Confidence intervals for odds ratios
exp(confint(model))
      

        4 Assess model fit
        # Deviance and AIC
model$deviance   # Residual deviance
model$null.deviance  # Null deviance
AIC(model)

# Pseudo R-squared (McFadden)
1 - (model$deviance / model$null.deviance)

# Confusion matrix (at threshold 0.5)
predicted <- ifelse(predict(model, type = "response") > 0.5, 1, 0)
table(Actual = heart$target, Predicted = predicted)
      
Expected OutputCall:
glm(formula = target ~ age + sex + cp + thalach + oldpeak, family = binomial(link = "logit"),
    data = heart)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.165518   2.025288  -1.563  0.11805
age          0.035939   0.018825   1.909  0.05624 .
sex          1.674501   0.350569   4.777 1.78e-06 ***
cp           0.896265   0.170199   5.266 1.39e-07 ***
thalach     -0.024663   0.007989  -3.087  0.00202 **
oldpeak      0.682883   0.152781   4.470 7.83e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 417.98  on 302  degrees of freedom
Residual deviance: 273.13  on 297  degrees of freedom
AIC: 285.13

Number of Fisher Scoring iterations: 5

        1 Import libraries and load data
        import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load the UCI Heart Disease dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cols = ["age", "sex", "cp", "trestbps", "chol", "fbs",
        "restecg", "thalach", "exang", "oldpeak",
        "slope", "ca", "thal", "target"]
heart = pd.read_csv(url, header=None, names=cols)

# Binarise target: 0 = no disease, 1 = disease
heart['target'] = (heart['target'] > 0).astype(int)

# Preview the data
print(heart.head())
      
2 Fit the GLM# Fit logistic regression using formula interface
model = smf.glm(
    formula="target ~ age + sex + cp + thalach + oldpeak",
    data=heart,
    family=sm.families.Binomial()
).fit()
          Note: Binomial() uses the logit link by default.
          You could explicitly specify Binomial(link=sm.families.links.Logit()).
        

        3 View results
        # Summary of fitted model
print(model.summary())

# Extract coefficients (log-odds scale)
print(model.params)

# Calculate odds ratios
print(np.exp(model.params))

# Confidence intervals for odds ratios
print(np.exp(model.conf_int()))
      

        4 Assess model fit
        # Deviance and AIC
print(f"Deviance: {model.deviance:.2f}")
print(f"AIC: {model.aic:.2f}")

# Pseudo R-squared (McFadden)
pseudo_r2 = 1 - (model.deviance / model.null_deviance)
print(f"Pseudo R²: {pseudo_r2:.4f}")

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
predicted = (model.predict() > 0.5).astype(int)
print(confusion_matrix(heart['target'], predicted))
print(classification_report(heart['target'], predicted))
      
Expected Output                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                 target   No. Observations:                  303
Model:                            GLM   Df Residuals:                      297
Model Family:                Binomial   Df Model:                            5
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -136.57
Date:                Thu, 11 Dec 2025   Deviance:                       273.13
Time:                        16:37:21   Pearson chi2:                     309.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.3800
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.1655      2.025     -1.563      0.118      -7.135       0.804
age            0.0359      0.019      1.909      0.056      -0.001       0.073
sex            1.6745      0.351      4.777      0.000       0.987       2.362
cp             0.8963      0.170      5.266      0.000       0.563       1.230
thalach       -0.0247      0.008     -3.087      0.002      -0.040      -0.009
oldpeak        0.6829      0.153      4.470      0.000       0.383       0.982
==============================================================================

What's Next?

Interpret coefficients as odds ratios: $e^{0.0359} = 1.037$ for age means ~4% higher odds per year; $e^{1.67} = 5.3$ for sex means males have 5.3x higher odds than females
Assess predictive accuracy: Calculate AUC-ROC, sensitivity, specificity at various thresholds
Check for influential observations: Cook's distance, leverage plots
Validate predictions: Cross-validation or train/test split to assess generalization
Consider interactions: Do effects differ by age group or sex? Try age:sex interaction terms

← Back to Fitting Going Further (Advanced) → Start Over

Tutorial 2: Implementation

Tutorial Complete!

Your Complete Model

1 Load the data

2 Fit the GLM

3 View results

4 Assess model fit

Expected Output

1 Import libraries and load data

2 Fit the GLM

3 View results

4 Assess model fit

Expected Output

What's Next?