Jon Minton’s Blog - Part Twelve: Honest Predictions the slightly-less easier way

Aim

The last post ended by showing how the predict function can be used to show point estimates and uncertainty intervals for expected values and predicted values for a model based on a toothsome dataset. In this post we will start with that model and look at other information that can be recovered from it, information that will allow the effects of joint parameter uncertainty to be propagated through to prediction.

Recap of core concepts

Back in part 8 we stated that estimates of the cloud of uncertainty in model parameters, that results from having limited numbers of observations in the data, can be represented as:

\[ \tilde{\theta} \sim MVN(\mu = \dot{\theta}, \sigma^2 = \Sigma) \]

Where MVN means multivariate normal, and needs the two quantities \(\dot{\theta}\) and \(\Sigma\) as parameters.

Previously we showed how to extract (estimates of) these two quantities from optim(), where the first quantity, \(\dot{\theta}\), was taken from the converged parameter point estimate slot par, and the second quantity, \(\Sigma\), was derived from the hessian slot.

But we don’t need to use optim() directly in order to recover these quantities. Instead we can get them from the standard model objects produced by either lm() or glm(). Let’s check this out…

Building our model

Let’s load the data and model we arrived at previously

Code

library(tidyverse)
df <- ToothGrowth |> tibble()
df

# A tibble: 60 × 3
     len supp   dose
   <dbl> <fct> <dbl>
 1   4.2 VC      0.5
 2  11.5 VC      0.5
 3   7.3 VC      0.5
 4   5.8 VC      0.5
 5   6.4 VC      0.5
 6  10   VC      0.5
 7  11.2 VC      0.5
 8  11.2 VC      0.5
 9   5.2 VC      0.5
10   7   VC      0.5
# ℹ 50 more rows

Code

best_model <- lm(len ~ log(dose) * supp, data = df)

summary(best_model)


Call:
lm(formula = len ~ log(dose) * supp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5433 -2.4921 -0.5033  2.7117  7.8567 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       20.6633     0.6791  30.425  < 2e-16 ***
log(dose)          9.2549     1.2000   7.712  2.3e-10 ***
suppVC            -3.7000     0.9605  -3.852 0.000303 ***
log(dose):suppVC   3.8448     1.6971   2.266 0.027366 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.72 on 56 degrees of freedom
Multiple R-squared:  0.7755,    Adjusted R-squared:  0.7635 
F-statistic:  64.5 on 3 and 56 DF,  p-value: < 2.2e-16

Let’s now look at some convenience functions, other than just summary, that work with lm() and glm() objects, and recover the quantities required from MVN to represent the uncertainty cloud.

Extracting quantities for modelling uncertainty

Firstly, for the point estimates \(\dot{\theta}\), we can use the coefficients() function

Code

coef <- coefficients(best_model)

coef

     (Intercept)        log(dose)           suppVC log(dose):suppVC 
       20.663333         9.254889        -3.700000         3.844782

And for the variance-covariance matrix, for representing joint uncertainty about the above estimates, we can use the vcov function

Code

Sig <- vcov(best_model)

Sig

                   (Intercept)     log(dose)        suppVC log(dose):suppVC
(Intercept)       4.612422e-01 -8.768056e-17 -4.612422e-01    -7.224251e-17
log(dose)        -8.768056e-17  1.440023e+00  1.753611e-16    -1.440023e+00
suppVC           -4.612422e-01  1.753611e-16  9.224843e-01     1.748938e-16
log(dose):suppVC -7.224251e-17 -1.440023e+00  1.748938e-16     2.880045e+00

Finally, we can extract the point estimate for stochastic variation in the model, i.e. variation assumed by the model even if parameter uncertainty were minimised, using the sigma function:

Code

sig <- sigma(best_model)

sig

[1] 3.719847

We now have three quantities, coef, Sig and sig (note the upper and lower case s in the above). These provide something almost but not exactly equivalent to the contents of par and that derived from hessian when using optim() previously. The section below explains this distinction in more detail.

Back to the weeds (potentially skippable)

Recall the ‘grandmother formulae’, from King, Tomz, and Wittenberg (2000), which the first few posts in this series started with:

Stochastic Component

\[ Y_i \sim f(\theta_i, \alpha) \]

Systematic Component

\[ \theta_i = g(X_i, \beta) \]

For standard linear regression this becomes:

Stochastic Component

\[ Y_i \sim Norm(\theta_i, \sigma^2) \]

Systematic Component

\[ \theta_i =X_i \beta \]

Our main parameters are \(\theta\), which combined our predictors \(X_i\) and our model parameter estimates \(\beta\). Of these two components we know the data - they are what they are - but are merely estimating our model parameters \(\beta\). So, any estimation uncertainty in this part of the equation results from \(\beta\) alone.

Our ancillary parameter is \(\sigma^2\). This is our estimate of how much fundamental variation there is in how the data (the response variables \(Y\)) is drawn from the stochastic data generating process.

When we used optim() directly, we estimated \(\sigma^2\) along with the other \(\beta\) parameters, via the \(\eta\) parameter eta, defined as \(\sigma^2 = e^{\eta}\) to allow optim() to search over an unbounded real number range. If there are k \(\beta\) parameters, therefore, optim()’s par vector contained k + 1 values, with this last value being the point estimate for the eta parameter. Similarly, the number of rows, columns, and length of diagonal elements in the variance-covariance matrix recoverable through optim’s hessian slot was also k + 1 rather than k, with the last row, last column, and last diagonal element being measures of covariance between \(\eta\) and the \(\beta\) elements, and variance in \(\eta\) itself.

By contrast, the length of coefficients returned by coefficients(best_model) is k, the number of \(\beta\) parameters being estimated, and the dimensions of vcov(best_model) returned are also k by k.

This means there is one fewer piece/type of information about model parameters returned by coefficients(model), vcov(model) and sigma(model) than was potentially recoverable by optim()’s par and hessian parameter slots: namely, uncertainty about the true value of the ancillary parameter \(\sigma^2\). The following table summarises this difference:

Information type	via optim	via lm and glm
Main parameters: point	first k elements of `par`	`coefficients()` function
Main parameters: uncertainty	first k rows and columns of `hessian`	`vcov()` function
Ancillary parameters: point	k+1th through to last element of `par`	`sigma()` function or equivalent for `glm()`
Ancillary parameters: uncertainty	last columns and rows of `hessian` (after rows and columns k)	—

So long as capturing uncertainty about the fundamental variability in the stochastic part of the model isn’t critical to our predictions then omission of a measure of uncertainty in the ancillary parameters \(\alpha\) is likely a price worth paying for the additional convenience of being able to use the model objects directly. However we should be aware that, whereas with optim we potentially have both \(\tilde{\beta}\) and \(\tilde{\alpha}\) to represent model uncertainty, when using the three convenience functions coefficients(), vcov() and sigma() we technically ‘only’ have \(\tilde{\beta}\) and \(\dot{\alpha}\) (i.e. point estimates alone for the ancillary parameters).

With the above caveat in mind, let’s now look at using the results of coefficients(), vcov() and sigma() to generate (mostly) honest representations of expected values, predicted values, and first differences

Model predictions

As covered in section two, we can use the mvrnorm function from the MASS package to create \(\tilde{\beta}\), our parameter estimates with uncertainty:

Parameter simulation

Code

beta_tilde <- MASS::mvrnorm(
    n = 10000, 
    mu = coef, 
    Sigma = Sig
)

head(beta_tilde)

     (Intercept) log(dose)    suppVC log(dose):suppVC
[1,]    21.06791  7.645689 -3.819112         6.625200
[2,]    21.48901  7.676168 -3.516544         4.243647
[3,]    19.14443  9.214789 -1.195671         3.862845
[4,]    20.92187 10.205717 -3.823071         2.742209
[5,]    20.37284  9.212964 -2.789648         3.912550
[6,]    20.44124  9.132118 -3.280281         1.993097

Let’s first look at each of these parameters individually:

Code

beta_tilde |> 
    as_tibble() |>
    pivot_longer(everything(), names_to = "coefficient", values_to = "value") |> 
    ggplot(aes(x = value)) + 
    facet_grid(coefficient ~ .) + 
    geom_histogram(bins = 100) + 
    geom_vline(xintercept = 0)

Now let’s look at a couple of coefficients jointly, to see how they’re correlated. Firstly the association between the intercept and the log dosage:

Code

beta_tilde |> 
    as_tibble() |>
    ggplot(aes(x = `(Intercept)`, y = `log(dose)`)) + 
    geom_density_2d_filled() + 
    coord_equal()

Here the covariance between the two parameters appears very low. Now let’s look at how log dosage and Vitamin C supplement factor are associated:

Code

beta_tilde |> 
    as_tibble() |>
    ggplot(aes(x = `log(dose)`, y = suppVC)) + 
    geom_density_2d_filled() + 
    coord_equal()

Again, the covariance appears low. Finally, the association between log dose and the interaction term

Code

beta_tilde |> 
    as_tibble() |>
    ggplot(aes(x = `log(dose)`, y = `log(dose):suppVC`)) + 
    geom_density_2d_filled() + 
    coord_equal()

Here we have a much stronger negative covariance between the two coefficients. Let’s look at the variance-covariance extracted from the model previously to confirm this:

Code

knitr::kable(Sig)

	(Intercept)	log(dose)	suppVC	log(dose):suppVC
(Intercept)	0.4612422	0.000000	-0.4612422	0.000000
log(dose)	0.0000000	1.440023	0.0000000	-1.440023
suppVC	-0.4612422	0.000000	0.9224843	0.000000
log(dose):suppVC	0.0000000	-1.440023	0.0000000	2.880045

Here we can see that the covariance between intercept and log dose is effectively zero, as is the covariance between the intercept and the interaction term, and the covariance between the log(dose) and suppVC factor. However, there is a negative covariance between log dose and the interaction term, i.e. what we have plotted above, and also between the intercept and the VC factor. For completeness, let’s look at this last assocation, which we expect to show negative association:

Code

beta_tilde |> 
    as_tibble() |>
    ggplot(aes(x = `(Intercept)`, y = suppVC)) + 
    geom_density_2d_filled() + 
    coord_equal()

Yes it is! The parameter estimates follow the covariance provided by Sigma, as we would expect.

Expected values

Let’s stay we are initially interested in the expected values for a dosage of 1.25mg, with the OJ (rather than VC) supplement:

Code

# first element is 1 due to intercept
predictor <- c(1, log(1.25), 0, 0) 

predictions_ev <- apply(
    beta_tilde, 
    1, 
    FUN = function(this_beta) predictor %*% this_beta
)

head(predictions_ev)

[1] 22.77400 23.20190 21.20065 23.19921 22.42865 22.47902

Let’s now get a 95% credible interval:

Code

quantile(predictions_ev, probs = c(0.025, 0.500, 0.975))

    2.5%      50%    97.5% 
21.31125 22.73087 24.18298

So, the 95% interval for the expected value is between 21.31 and 24.14, with a middle (median) estimate of 22.73.¹ Let’s check this against estimates from the predict() function:

Code

predict(best_model, newdata = data.frame(dose = 1.25, supp = "OJ"), interval = 'confidence')

      fit      lwr      upr
1 22.7285 21.26607 24.19093

The expected values using the predict function give a 95% confidence interval of 21.27 to 24.19, with a point estimate of 22.73. These are not identical, as the methods employed are not identical,² but they are hopefully similar enough to demonstrate they are attempts at getting at the same quantities of interest.

Predicted values

Predicted values also include inherent stochastic variation from the ancillary parameters \(\alpha\), which for linear regression is \(\sigma^2\). We can simply add these only the expected values above to produce predicted values:

Code

n <- length(predictions_ev)

shoogliness <- rnorm(n=n, mean = 0, sd = sig)

predictions_pv <- predictions_ev + shoogliness


head(predictions_pv)

[1] 19.19537 19.87588 20.12654 28.13522 26.16638 22.19669

Let’s get the 95% interval from the above using quantile

Code

quantile(predictions_pv, probs = c(0.025, 0.5000, 0.975))

    2.5%      50%    97.5% 
15.17997 22.73280 30.16406

As expected, the interval is now much wider, with a 95% interval from 15.34 to 30.11. The central estimate should in theory, with an infinite number of runs, be the same, however because of random variation it will never be exactly the same to an arbitrary number of decimal places. In this case, the middle estimate is 22.75, not identical to the central estimate from the expected values distribution of 22.72. The number of simulations can always be increased to produce greater precision if needed.

Let’s now compare this with the prediction interval produce by the predict function:

Code

predict(best_model, newdata = data.frame(dose = 1.25, supp = "OJ"), interval = 'prediction')

      fit      lwr     upr
1 22.7285 15.13461 30.3224

Again, the interval estimates are not exactly the same, but they are very similar.

First differences

It’s in the production of estimates of first differences - this, compared to that, holding all else constant - that the simulation approach shines for producing estimates with credible uncertainty. In our case, let’s say we are interested in asking:

What is the expected effect of using the VC supplement, rather than the OJ supplement, where the dose is 1.25mg?

So, the first difference is from switching from OJ to VC, holding the other factor constant.

We can answer this question by using the same selection of \(\tilde{\beta}\) draws, but passing two different scenarios:

Code

#scenario 0: supplement is OJ
predictor_x0 <- c(1, log(1.25), 0, 0) 

#scenario 1: supplement is VC
predictor_x1 <- c(1, log(1.25), 1, 1 * log(1.25)) 


predictions_ev_x0 <- apply(
    beta_tilde, 
    1, 
    FUN = function(this_beta) predictor_x0 %*% this_beta
)

predictions_ev_x1 <- apply(
    beta_tilde, 
    1, 
    FUN = function(this_beta) predictor_x1 %*% this_beta
)

predictions_df <- 
    tibble(
        x0 = predictions_ev_x0,
        x1 = predictions_ev_x1
    ) |>
    mutate(
        fd = x1 - x0
    )

predictions_df

# A tibble: 10,000 × 3
      x0    x1     fd
   <dbl> <dbl>  <dbl>
 1  22.8  20.4 -2.34 
 2  23.2  20.6 -2.57 
 3  21.2  20.9 -0.334
 4  23.2  20.0 -3.21 
 5  22.4  20.5 -1.92 
 6  22.5  19.6 -2.84 
 7  22.1  20.0 -2.17 
 8  23.4  20.3 -3.17 
 9  21.7  19.9 -1.78 
10  21.8  20.0 -1.85 
# ℹ 9,990 more rows

Let’s look at the distribution of both scenarios individually:

Code

predictions_df |>
    pivot_longer(everything(), names_to = "scenario", values_to = "estimate") |>
    filter(scenario != "fd") |>
    ggplot(aes(x = estimate)) + 
    facet_wrap(~scenario, ncol = 1) + 
    geom_histogram(bins = 100)

And the distribution of the pairwise differences between them:

Code

predictions_df |>
    pivot_longer(everything(), names_to = "scenario", values_to = "estimate") |>
    filter(scenario == "fd") |>
    ggplot(aes(x = estimate)) + 
    geom_histogram(bins = 100) + 
    geom_vline(xintercept = 0)

It’s this last distribution which shows our first differences, i.e. our answer, hedged with an appropriate dose of uncertainty, to the specific question shown above. We can get a 95% interval of the first difference as follows:

Code

predictions_df |>
    pivot_longer(everything(), names_to = "scenario", values_to = "estimate") |>
    filter(scenario == "fd") |> 
    pull('estimate') |>
    quantile(probs = c(0.025, 0.500, 0.975))

      2.5%        50%      97.5% 
-4.8919856 -2.8561481 -0.8011861

So, 95% of estimates of the first difference are between -4.85 and -0.81, with the middle of this distribution (on this occasion) being -2.83.

Unlike with the expected values and predicted values, the predict() function does not return first differences with honest uncertainty in this way. What we have above is something new.

Summary

In this post we’ve finally combined all the learning we’ve developed over the 11 previous posts to answer three specific ‘what if?’ questions: one on expected values, one on predicted values, and one on first differences. These are what King, Tomz, and Wittenberg (2000) refer to as quantities of interest, and I hope you agree these are more organic and reasonable types of question to ask of data and statistical models than simply looking at coefficients and p-values and reporting which ones are ‘statistically significant’.

If you’ve been able to follow everything in these posts, and can generalise the approach shown above to other types of statistical model, then congratulations! You’ve learned the framework for answering meaningful questions using statistical models which is at the heart of one of the toughest methods courses for social scientists offered by one of the most prestigious universities in the world.³

Coming up

The next post uses the same dataset and model we’ve developed and applied, but shows how it can be implemented using a Bayesian rather than Frequentist modelling approach. In some ways it’s very familar, but in others it introduces a completely new paradigm to how models are fit and run.

References

King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44: 341355. http://gking.harvard.edu/files/abs/making-abs.shtml.

Footnotes

These are the values produced the first time I ran the simulation. They are likely to be a little different each time, so may not be identical to the number of decimal places reported when I next render this document. These estimates are approximations.↩︎
Because the simulation approach relies on random numbers, the draws will never be the same unless the same random number seed is using using set.seed(). However with more simulations, using the n parameter from mvrnorm, the distributions of estimates should become ever closer to each other.↩︎
I took this course via the Harvard extension school while doing my PhD in York quite a few years ago. I took it as a non-credit option - as what’s the value of a fraction of a degree when I had two already? - but was told by the tutors that I’d completed it to a ‘Grade A-’ level. So, not perfect, but good enough…↩︎