Part Seven: Feeling Uncertain

statistics
R
Author

Jon Minton

Published

January 4, 2024

Aim

In the previous post we managed to use numerical optimisation, with the optim() function, to good \(\beta\) estimates for linear regression model fit to some toy data. In this post, we will explore how the optim() function can be used to produce estimates of uncertainty about these \(\beta\) coefficients, and how these relates to measures of uncertainty presented in the standard lm and glm summary functions.

Prereqs

As before, we’ll be using the same toy dataset, and same log likelihood function, as in the last two posts in this series. Let’s create these again:

Code
llNormal <- function(pars, y, X){
    beta <- pars[1:ncol(X)]
    sigma2 <- exp(pars[ncol(X)+1])
    -1/2 * (sum(log(sigma2) + (y - (X%*%beta))^2 / sigma2))
}
Code
# set a seed so runs are identical
set.seed(7)
# create a main predictor variable vector: -3 to 5 in increments of 1
x <- (-3):5
# Record the number of observations in x
N <- length(x)
# Create a response variable with variability
y <- 2.5 + 1.4 * x  + rnorm(N, mean = 0, sd = 0.5)

# bind x into a two column matrix whose first column is a vector of 1s (for the intercept)

X <- cbind(rep(1, N), x)
# Clean up names
colnames(X) <- NULL

Let’s also run and save our parameter estimates produced both ‘the hard way’ (using optim), and ‘the easier way’ (using ‘glm’)

Code
optim_results <-  optim(
    # par contains our initial guesses for the three parameters to estimate
    par = c(0, 0, 0), 

    # by default, most optim algorithms prefer to search for a minima (lowest point) rather than maxima 
    # (highest point). So, I'm making a function to call which simply inverts the log likelihood by multiplying 
    # what it returns by -1
    fn = function(par, y, X) {-llNormal(par, y, X)}, 

    # in addition to the par vector, our function also needs the observed output (y)
    # and the observed predictors (X). These have to be specified as additional arguments.
    y = y, X = X
    )

optim_results
$par
[1]  2.460571  1.375421 -1.336209

$value
[1] -1.51397

$counts
function gradient 
     216       NA 

$convergence
[1] 0

$message
NULL
Code
pars_optim <- optim_results$par

names(pars_optim) <- c("beta0", "beta1", "eta")

pars_optim
    beta0     beta1       eta 
 2.460571  1.375421 -1.336209 
Code
library(tidyverse)
df <- tibble(x = x, y = y)
mod_glm <- glm(y ~ x, data = df, family = gaussian(link="identity"))
summary(mod_glm)

Call:
glm(formula = y ~ x, family = gaussian(link = "identity"), data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46067    0.20778   11.84 6.95e-06 ***
x            1.37542    0.07504   18.33 3.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.3378601)

    Null deviance: 115.873  on 8  degrees of freedom
Residual deviance:   2.365  on 7  degrees of freedom
AIC: 19.513

Number of Fisher Scoring iterations: 2

So, both optim and the summary to mod_glm report \(\{\beta_0 = 2.36, \beta_1 = 1.38\}\), so both approaches appear to arrive at the same point on the log likelihood surface.

However, note that the glm summary reports not just the estimates themselves (in the Estimate column of coefficients), but also standard errors (the Std. Error column) and derived quantities (t value, Pr(>|t|), and the damnable stars at the very right of the table). How can these measures of uncertainty about the true value of the \(\beta\) coefficients be derived from optim?

Barefoot and Blind: A weird analogy for a complicated idea

Imagine optim, your hill-finding robo-chauffeur, has taken you to the top of a likelihood surface. Then it leaves you there…

… and you’re blind, and have no shoes. (You also have an uncanny sense of your orientation, whether north-south, east-west, or some other angle.)

So, you know you’re at the top of the hill, but you can’t see what the landscape around you looks like. However, you still want to get a sense of this landscape, and how it varies around the spot you’re standing on.

What do you do?

If you’re playing along with this weird thought experiment, one approach would be to use your feet as depth sensors. You make sure you never stray from where you started, and to always keep one foot planted on this initial spot (which you understand to be the highest point on the landscape). Then you use your other foot to work out how much further down the surface is from the highest point as you venture away from the highest point in different directions.

Say you keep your left foot planted on the highest point, and make sure your right foot is always positioned (say) 10 cm horizontally from your left foot. Initially your two feet are arranged east-west; let’s call this 0 degrees. When you put your right foot down, you notice it needs to travel 2 cm further down to reach terra ferma relative to your left foot.

2cm at 0 degrees. You’ll remember that.

Now you rotate yourself 45 degrees, and repeat the same right foot drop. This time it needs to travel 3cm down relative to your left foot.

3cm at 45 degrees. You remember that too.

Now you rotate another 45 degrees, north-south orientation, place your right foot down; now it falls 5cm down relative to your left foot.

2cm at 0 degrees; 3cm at 45 degrees; 5cm at 90 degrees.

Now with this information, you try to construct the landscape you’re on top of with your mind’s eye, making the assumption that the way it has to have curved from the peak you’re on to lead to the drops you’ve observed is consistent all around you; i.e. that there’s only one hill, you’re on top of it, and it’s smoothly curved in all directions.

If you could further entertain the idea that your feet are infinitely small, and the gap between feet is also infinitely small (rather than the 10cm above), then you have the intuition behind this scary-looking but very important formula from King (1998) (p. 89):

\[ \widehat{V(\hat{\theta})} = - \frac{1}{n}[\frac{\delta^2lnL(\tilde{\theta}|y)}{\delta \tilde{\theta} \delta \tilde{\theta}^{'}}]^{-1}_{\tilde{\theta} = \hat{\theta}} \]

What this is saying, in something closer to humanese, is something like:

Our best estimate of the amount of uncertainty we have in our estimates is a function of how much the likelihood surface curves at the highest point on the surface. (It also gets less uncertain, the more observations we have).

Information and uncertainty

Amongst the various bells, whistles and decals in the previous formula is the superscript \((.)^{-1}\). This means invert, which for a single value means \(\frac{1}{.}\) but for a matrix means something conceptually the same but technically not.

And what’s being inverted in the last formula? A horrible-looking expression, \([\frac{\delta^2lnL(\tilde{\theta}|y)}{\delta \tilde{\theta} \delta \tilde{\theta}^{'}}]_{\tilde{\theta} = \hat{\theta}}\), that’s basically an answer to the question of how curvy is the log likelihood surface at its peak position?

Within King (1998) (p.89, eq. 4.18), this expression (or rather the negative of the term) is defined as \(I(\hat{\theta} | y)\), where \(I(.)\) stands for information.

So, the algebra are saying

Uncertainty is inversely related to information

Or perhaps even more intuitively

The more information we have, the less uncertain we are

Of course this makes sense. If you ask someone “How long will this task take?”, and they say “Between one hour and one month”, they likely have less information about how long the task will actually than if they had said “Between two and a half and three hours”. More generally:

  • Shallow gradients mean wide uncertainty intervals mean low information
  • Sharp gradients mean narrow uncertaintly intervals mean high information

This is, fundamentally, what the blind and barefoot person in the previous analogy is trying to achieve: by feeling out the local curvature around the highest point, they are trying to work out how much information they have about different pieces of the model. The curvature along any one dimension of the surface (equivalent to the 0 and 90 degree explorations) indicates how much information there is about any single coefficient, and the curvature along the equivalent of a 45 degree plane gives a measure of how associated any two coefficients tend to be.

With these many analogies and equations spinning in our heads, let’s now see how these concepts can be applied in practice.

Optimal uncertainty

Having reminded myself of the particular options for optim that are typically used to report parameter uncertainty, let’s run the follows:

Code
fuller_optim_output <- optim(
    par = c(0, 0, 0), 
    fn = llNormal,
    method = "BFGS",
    control = list(fnscale = -1),
    hessian = TRUE,
    y = y, 
    X = X
)

fuller_optim_output
$par
[1]  2.460675  1.375424 -1.336438

$value
[1] 1.51397

$counts
function gradient 
      80       36 

$convergence
[1] 0

$message
NULL

$hessian
              [,1]          [,2]          [,3]
[1,] -3.424917e+01 -3.424917e+01  2.727840e-05
[2,] -3.424917e+01 -2.625770e+02 -2.818257e-05
[3,]  2.727840e-05 -2.818257e-05 -4.500002e+00

We have used a slightly different algorithm (‘BFGS’), and a different way of specifying the function to search over (using fnscale = -1 to invert the likelihood), but we have the same par estimates as before: \(\beta = \{\beta_0 = 2.46, \beta_1 = 1.38\}\). So the changes we’ve made to the optim arguments haven’t changed what it estimates.

One new argument we’ve set in optim is hessian = TRUE. Hessian is a kind of coarse fabric made from vegetable waste, typically woven in a criss-crossing, grid-like pattern. Hessian matrices are matrices of second derivatives, as described in the wikipedia article. 1 If you can bear to recall the really complex expression above, for calculating the curvature around a point on a surface, you’ll recall it’s also about second derivatives.

None of this is a coincidence. The hessian component of the optim output above contains what we need.

Code
hess <- fuller_optim_output$hessian
hess
              [,1]          [,2]          [,3]
[1,] -3.424917e+01 -3.424917e+01  2.727840e-05
[2,] -3.424917e+01 -2.625770e+02 -2.818257e-05
[3,]  2.727840e-05 -2.818257e-05 -4.500002e+00

You might notice that the Hessian matrix is square, with as many columns as rows. And, that the number of columns (or rows) is equal to the number of parameters we have estimated, i.e. three in this case.

You might also notice that the values are symmetrical about the diagonal running from the top left to the bottom right.

Again, this is no accident.

Remember that variation is inversely related to information, and that \((.)^{-1}\) is the inversion operator on \(I(.)\), the Information Matrix. Well, this Hessian is (pretty much) \(I(.)\). So let’s see what happens when we invert it (using the solve operator):

Code
inv_hess <- solve(-hess)
inv_hess
              [,1]          [,2]          [,3]
[1,]  3.357745e-02 -4.379668e-03  2.309709e-07
[2,] -4.379668e-03  4.379668e-03 -5.397790e-08
[3,]  2.309709e-07 -5.397790e-08  2.222221e-01

As with hess, inv_hess is symmetric around the top-left to bottom-right diagonal. For example, the value on row 2 and column 1 is the same as on row 1, column 2.

We’re mainly interested in the first two columns and rows, as these contain the values most comparable with the glm summary reports

Code
inv_hess_betas <- inv_hess[1:2, 1:2]

inv_hess_betas
             [,1]         [,2]
[1,]  0.033577455 -0.004379668
[2,] -0.004379668  0.004379668

What the elements of the above matrix provide are estimates of the variances of a single parameter \(\beta_j\), and/or the covariances between any two parameters \(\{\beta_0, \beta_1\}\). In this example:

\[ \begin{bmatrix} var(\beta_0) & cov(\beta_0, \beta_1) \\ cov(\beta_1, \beta_0) & var(\beta_1) \end{bmatrix} \]

It’s because the on-diagonal terms are variances of uncertaintly for a single term, that it can be useful to take the square root of these terms to get estimates of the standard errors:

Code
sqrt(diag(inv_hess_betas))
[1] 0.18324152 0.06617906

Compare with the Std Err term in the following:

Code
summary(mod_glm)

Call:
glm(formula = y ~ x, family = gaussian(link = "identity"), data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46067    0.20778   11.84 6.95e-06 ***
x            1.37542    0.07504   18.33 3.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.3378601)

    Null deviance: 115.873  on 8  degrees of freedom
Residual deviance:   2.365  on 7  degrees of freedom
AIC: 19.513

Number of Fisher Scoring iterations: 2

The estimates from the Hessian in optim, of \(\{0.18, 0.07\}\), are not exactly the same as the \(\{0.21, 0.08\}\) reported for mod_glm; the methods employed are not identical. But they are hopefully similar enough to demonstrate they provide similar information about similar quantities of uncertainty.

Summary

This is probably the most difficult single section so far. Don’t worry: it’s likely to get easier from here on in.

Coming up

The next part of the series goes into more detail about how numerical optimisation works.

References

King, Gary. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. University of Michigan Press. http://www.jstor.org/stable/10.3998/mpub.23784.

Footnotes

  1. Though I had assumed Hessian matrices are called Hessian matrices because they sort-of resemble the criss-crossing grids of Hessian bags, they’re actually named after Otto Hesse, who proposed them.↩︎