Jon Minton’s Blog - Part Seven: Feeling Uncertain

Aim

In the previous post we managed to use numerical optimisation, with the optim() function, to good \(\beta\) estimates for linear regression model fit to some toy data. In this post, we will explore how the optim() function can be used to produce estimates of uncertainty about these \(\beta\) coefficients, and how these relates to measures of uncertainty presented in the standard lm and glm summary functions.

Prereqs

As before, we’ll be using the same toy dataset, and same log likelihood function, as in the last two posts in this series. Let’s create these again:

Code

llNormal <- function(pars, y, X){
    beta <- pars[1:ncol(X)]
    sigma2 <- exp(pars[ncol(X)+1])
    -1/2 * (sum(log(sigma2) + (y - (X%*%beta))^2 / sigma2))
}

Code

# set a seed so runs are identical
set.seed(7)
# create a main predictor variable vector: -3 to 5 in increments of 1
x <- (-3):5
# Record the number of observations in x
N <- length(x)
# Create a response variable with variability
y <- 2.5 + 1.4 * x  + rnorm(N, mean = 0, sd = 0.5)

# bind x into a two column matrix whose first column is a vector of 1s (for the intercept)

X <- cbind(rep(1, N), x)
# Clean up names
colnames(X) <- NULL

Let’s also run and save our parameter estimates produced both ‘the hard way’ (using optim), and ‘the easier way’ (using ‘glm’)

Code

optim_results <-  optim(
    # par contains our initial guesses for the three parameters to estimate
    par = c(0, 0, 0), 

    # by default, most optim algorithms prefer to search for a minima (lowest point) rather than maxima 
    # (highest point). So, I'm making a function to call which simply inverts the log likelihood by multiplying 
    # what it returns by -1
    fn = function(par, y, X) {-llNormal(par, y, X)}, 

    # in addition to the par vector, our function also needs the observed output (y)
    # and the observed predictors (X). These have to be specified as additional arguments.
    y = y, X = X
    )

optim_results

$par
[1]  2.460571  1.375421 -1.336209

$value
[1] -1.51397

$counts
function gradient 
     216       NA 

$convergence
[1] 0

$message
NULL

Code

pars_optim <- optim_results$par

names(pars_optim) <- c("beta0", "beta1", "eta")

pars_optim

    beta0     beta1       eta 
 2.460571  1.375421 -1.336209

Code

library(tidyverse)
df <- tibble(x = x, y = y)
mod_glm <- glm(y ~ x, data = df, family = gaussian(link="identity"))
summary(mod_glm)


Call:
glm(formula = y ~ x, family = gaussian(link = "identity"), data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46067    0.20778   11.84 6.95e-06 ***
x            1.37542    0.07504   18.33 3.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.3378601)

    Null deviance: 115.873  on 8  degrees of freedom
Residual deviance:   2.365  on 7  degrees of freedom
AIC: 19.513

Number of Fisher Scoring iterations: 2

So, both optim and the summary to mod_glm report \(\{\beta_0 = 2.36, \beta_1 = 1.38\}\), so both approaches appear to arrive at the same point on the log likelihood surface.

However, note that the glm summary reports not just the estimates themselves (in the Estimate column of coefficients), but also standard errors (the Std. Error column) and derived quantities (t value, Pr(>|t|), and the damnable stars at the very right of the table). How can these measures of uncertainty about the true value of the \(\beta\) coefficients be derived from optim?

Information and uncertainty

Amongst the various bells, whistles and decals in the previous formula is the superscript \((.)^{-1}\). This means invert, which for a single value means \(\frac{1}{.}\) but for a matrix means something conceptually the same but technically not.

And what’s being inverted in the last formula? A horrible-looking expression, \([\frac{\delta^2lnL(\tilde{\theta}|y)}{\delta \tilde{\theta} \delta \tilde{\theta}^{'}}]_{\tilde{\theta} = \hat{\theta}}\), that’s basically an answer to the question of how curvy is the log likelihood surface at its peak position?

Within King (1998) (p.89, eq. 4.18), this expression (or rather the negative of the term) is defined as \(I(\hat{\theta} | y)\), where \(I(.)\) stands for information.

So, the algebra are saying

Uncertainty is inversely related to information

Or perhaps even more intuitively

The more information we have, the less uncertain we are

Of course this makes sense. If you ask someone “How long will this task take?”, and they say “Between one hour and one month”, they likely have less information about how long the task will actually than if they had said “Between two and a half and three hours”. More generally:

Shallow gradients mean wide uncertainty intervals mean low information
Sharp gradients mean narrow uncertaintly intervals mean high information

This is, fundamentally, what the blind and barefoot person in the previous analogy is trying to achieve: by feeling out the local curvature around the highest point, they are trying to work out how much information they have about different pieces of the model. The curvature along any one dimension of the surface (equivalent to the 0 and 90 degree explorations) indicates how much information there is about any single coefficient, and the curvature along the equivalent of a 45 degree plane gives a measure of how associated any two coefficients tend to be.

With these many analogies and equations spinning in our heads, let’s now see how these concepts can be applied in practice.

Optimal uncertainty

Having reminded myself of the particular options for optim that are typically used to report parameter uncertainty, let’s run the follows:

Code

fuller_optim_output <- optim(
    par = c(0, 0, 0), 
    fn = llNormal,
    method = "BFGS",
    control = list(fnscale = -1),
    hessian = TRUE,
    y = y, 
    X = X
)

fuller_optim_output

$par
[1]  2.460675  1.375424 -1.336438

$value
[1] 1.51397

$counts
function gradient 
      80       36 

$convergence
[1] 0

$message
NULL

$hessian
              [,1]          [,2]          [,3]
[1,] -3.424917e+01 -3.424917e+01  2.727840e-05
[2,] -3.424917e+01 -2.625770e+02 -2.818257e-05
[3,]  2.727840e-05 -2.818257e-05 -4.500002e+00

We have used a slightly different algorithm (‘BFGS’), and a different way of specifying the function to search over (using fnscale = -1 to invert the likelihood), but we have the same par estimates as before: \(\beta = \{\beta_0 = 2.46, \beta_1 = 1.38\}\). So the changes we’ve made to the optim arguments haven’t changed what it estimates.

One new argument we’ve set in optim is hessian = TRUE. Hessian is a kind of coarse fabric made from vegetable waste, typically woven in a criss-crossing, grid-like pattern. Hessian matrices are matrices of second derivatives, as described in the wikipedia article. ¹ If you can bear to recall the really complex expression above, for calculating the curvature around a point on a surface, you’ll recall it’s also about second derivatives.

None of this is a coincidence. The hessian component of the optim output above contains what we need.

Code

hess <- fuller_optim_output$hessian
hess

              [,1]          [,2]          [,3]
[1,] -3.424917e+01 -3.424917e+01  2.727840e-05
[2,] -3.424917e+01 -2.625770e+02 -2.818257e-05
[3,]  2.727840e-05 -2.818257e-05 -4.500002e+00

You might notice that the Hessian matrix is square, with as many columns as rows. And, that the number of columns (or rows) is equal to the number of parameters we have estimated, i.e. three in this case.

You might also notice that the values are symmetrical about the diagonal running from the top left to the bottom right.

Again, this is no accident.

Remember that variation is inversely related to information, and that \((.)^{-1}\) is the inversion operator on \(I(.)\), the Information Matrix. Well, this Hessian is (pretty much) \(I(.)\). So let’s see what happens when we invert it (using the solve operator):

Code

inv_hess <- solve(-hess)
inv_hess

              [,1]          [,2]          [,3]
[1,]  3.357745e-02 -4.379668e-03  2.309709e-07
[2,] -4.379668e-03  4.379668e-03 -5.397790e-08
[3,]  2.309709e-07 -5.397790e-08  2.222221e-01

As with hess, inv_hess is symmetric around the top-left to bottom-right diagonal. For example, the value on row 2 and column 1 is the same as on row 1, column 2.

We’re mainly interested in the first two columns and rows, as these contain the values most comparable with the glm summary reports

Code

inv_hess_betas <- inv_hess[1:2, 1:2]

inv_hess_betas

             [,1]         [,2]
[1,]  0.033577455 -0.004379668
[2,] -0.004379668  0.004379668

What the elements of the above matrix provide are estimates of the variances of a single parameter \(\beta_j\), and/or the covariances between any two parameters \(\{\beta_0, \beta_1\}\). In this example:

\[ \begin{bmatrix} var(\beta_0) & cov(\beta_0, \beta_1) \\ cov(\beta_1, \beta_0) & var(\beta_1) \end{bmatrix} \]

It’s because the on-diagonal terms are variances of uncertaintly for a single term, that it can be useful to take the square root of these terms to get estimates of the standard errors:

Code

sqrt(diag(inv_hess_betas))

[1] 0.18324152 0.06617906

Compare with the Std Err term in the following:

Code

summary(mod_glm)


Call:
glm(formula = y ~ x, family = gaussian(link = "identity"), data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46067    0.20778   11.84 6.95e-06 ***
x            1.37542    0.07504   18.33 3.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.3378601)

    Null deviance: 115.873  on 8  degrees of freedom
Residual deviance:   2.365  on 7  degrees of freedom
AIC: 19.513

Number of Fisher Scoring iterations: 2

The estimates from the Hessian in optim, of \(\{0.18, 0.07\}\), are not exactly the same as the \(\{0.21, 0.08\}\) reported for mod_glm; the methods employed are not identical. But they are hopefully similar enough to demonstrate they provide similar information about similar quantities of uncertainty.

Summary

This is probably the most difficult single section so far. Don’t worry: it’s likely to get easier from here on in.

Coming up

The next part of the series goes into more detail about how numerical optimisation works.

References

King, Gary. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. University of Michigan Press. http://www.jstor.org/stable/10.3998/mpub.23784.

Footnotes

Though I had assumed Hessian matrices are called Hessian matrices because they sort-of resemble the criss-crossing grids of Hessian bags, they’re actually named after Otto Hesse, who proposed them.↩︎

Aim

Prereqs

Barefoot and Blind: A weird analogy for a complicated idea

Information and uncertainty

Optimal uncertainty

Summary

Coming up

References

Footnotes