In the last post we reached the end of a winding journey. This post will show how Bayesian approaches to model fitting, rather than the frequentist approaches more commonly used, can reach the intended destination of this journey more quickly, despite being a bit more conceptually challenging to start with.
Reintroduce statistical models via a generalised model formulae, comprising a systematic component and a stochastic component.
Reintroduce the fitting of statistical models from the perspective of algorithmic optimisation, in which the gap between what the model predicts and what’s observed is minimised in some way.
The rest of the first section of the series - posts two, three and four - added more context to the first post, and introduced the concept of using models for prediction - and the types of quantities of interest they can predict. The first section ended with post four, which illustrated some of the complexities of getting meaningful effect estimates - the overall effect of one specific predictor variable on the outcome being predicted - for model structures under than standard linear regression.
The second section - covering posts five to ten - delved into a lot more detail about how statistical models are fit. It introduced the concept of likelihood as a means of deciding what the target of a statistical optimisation algorithm should be. And it also showed - in sometimes excruciating detail - how to perform numeric optimisation based on likelihood in order to extract not just the best set of model parameters, but estimates of joint uncertainty in the best estimated set of model parameters. It’s this joint uncertainty in parameter estimates, estimated via the Hessian from the optim() function, which allowed uncertainty in model parameter estimates to be propagated and percolated through specific ‘what-if?’ questions - i.e. specific configurations of predictor variables passed through to the model - in order to produce honest answers to these ‘what-if?’ questions, which provide a range of answers, rather than a single answer, in order to show how model parameter estimation uncertainty leads to uncertainty in the answers the model provides.
The third section - posts 10-12 - completed the journey, showing how many of the concepts and ideas learned through considerable effort in sections one and (especially) two allow more intelligent and effective use of standard statistical model outputs - produced using R’s lm() and glm() functions - for honest prediction.
This post will extend the third section to show why the kind of honest prediction which we managed to produce using the kind of frequentist modelling framework used by lm() and glm() are, in fact, easier to produce using Bayesian models.
On marbles and jumping beans
Post five introduced Bayes’ Rule and the Likelihood axiom. It pointed out that, at heart, Bayes’ Rule is a way of expressing that given this in terms of this given that; and that Likelihood is also a claim about how that given this relates to this given that. More specifically, the claim of Likelihood is:
The likelihood of the model given the data is proportional to the probability of the data given the model.
There are two aspects to the model: firstly its structure; secondly its parameters. The structure includes the type of statistical model - whether it is a standard linear regression, negative binomial regression, logistic regression, Poisson regression model and so on - and also the specific types of columns from the dataset selected as either predictor variables (\(X\)) or response variables (\(Y\)). It is only after both the higher level structure of the model family, and the lower level structure of the data inputs (what’s being regressed on what?) have been decided that the Likelihood theory is used.
And how is Likelihood theory used? Well, it defines a landscape over which an algorithm searches. This landscape has as many dimensions as there are parameters to fit. Where there are just two parameters, \(\beta_0\) and \(\beta_1\) to fit, we can visualise this landscape using something like a contour plot, with \(\beta_0\) as latitude, \(\beta_1\) as longitude, and the likelihood at this position its elevation or depth. Each possible joint value \(\beta = \{\beta_0, \beta_1\}\) which the algorithm might wish to propose leads to a different long-lat coordinate over the surface, and each coordinate has a different elevation or depth. Although we can’t see beyond three dimensions (latitude, longitude, and elevation/depth), mathematics has no problem extending the concept of multidimensional space into far more dimensions than we can see or meaningfully comprehenend. If a model has ten parameters to fit, for example, the likelihood search space really is ten dimensional, and so on.
Noticed I used elevation and depth interchangably in the description above. Well, this is because it really doesn’t matter whether an optimisation algorithm is trying to find the greatest elevation over a surface, or the greatest depth over the surface. The aim of maximum likelihood estimation is to find the configuration of parameters that maximises the likelihood, i.e. finds the top of the surface. However we saw that when passing the likelihood function to optim() we often inverted the function by multiplying it by -1. This is because the optimisation algorithms themselves seek to minimise the objective function they’re passed, not maximise it. By multiplying the likelihood function by -1 we made what we were trying to seek compatible with what the optimisation algorithms seek to do: find the greatest depth over a surface, rather than the highest elevation over the surface.
To make this all a bit less abstract let’s develop the intuition of an algorithm that seeks to minimise a function by way of a(nother) weird little story:
Imagine there is a landscape made out of transparent perspex. It’s not just transparent, it’s invisible to the naked eye. And you want to know where the lowest point of this surface is. All you have to do this is a magical leaking marble. The marble is just like any other marble, except every few moments, at regular intervals (say every tenth of a second), it dribbles out a white dye that you can see. And this dye sticks on and stains the otherwise invisible landscape whose lowest point you wish to find.
Now, you drop the marble somewhere on the surface. You see the first point it hits on the surface - a white blob appears. The second blob appears some distance away from the first blob; and the third blob slightly less far away from the second blob as the second was to the second. After a few seconds, a trail of white spots is visible, the first few of which form something like a straight line, each consecutive point slightly less closer to the previous one. A second or two later, and the rumbling sounds of the marble rolling over the surface cease; the marble has clearly run out of momentum. And as you look at the trail of dots it’s generated, and is still generating, and you see it keeps highlighting the same point on the otherwise invisible surface, again and again.
Previously I used the analogy of a magical robo-chauffer, taking you to the top of a landscape. But the falling marble is probably a closer analogy to how many of optim()’s algorithms actually work. Using gravity and its shape alone, it finds the lowest point on the surface, and with its magical leaking dye, it tells you where this lowest point is.
The marble has now ‘told’ you where the lowest point on the invisible surface is. However you also want to know more about the shape of the depression it’s in. You want to know if it’s a steep depression, or a shallow depression. And you want to know if it’s as steep or shallow in every direction, or if it’s steeper in some ways than the other.
So you now have to do a bit more work. You move your hand to just above the marble, and with your forefinger ‘flick’ it in a particular direction (say east-west): you see it move in the direction you flick it briefly, before rolling back towards (and beyond, and then towards) the depression point. As it does so, it leaks dye onto the surface, revealing a bit more about the landscape’s steepness or shallowness in this dimension. Then you do the same, but along a different dimension (say, north-south). After you’ve done this enough times, you are left with a collection of dyed points on the part of the surface closest to its deepest depression. The spacing and shape of these points tells you something about the nature of the depression and the part of the landscape it’s surrounding.
Notice in this analogy you had to do extra work to get the marble to reveal more information about the surface. By default, the marble tells you the specific location of the depression, but not what the surface is like around this point. Instead, you need to intervene twice: firstly by dropping the marble onto the surface; secondly by flicking it around once it’s reached the lowest point on the surface.
Now, let’s imagine swapping out our magical leaking marble for something even weirder: a magical leaking jumping bean.
The magical jumping bean does two things: it leaks and it jumps. (Okay, it does three things: when it leaks it also sticks to the surface it’s dying). When the bean is first dropped onto the surface, it marks the location it lands on. Then, it jumps up and across in a random direction. After jumping, it drops onto another part of the surface, marks it, and the process starts again. Jumping, sticking, marking; jumping, sticking, marking; jumping, sticking, marking… potentially forever.
Because of the effect of gravity, though the jumping bean jumps in a random direction, after a few jump-stick-mark steps it’s still, like the marble, very likely to move towards the depression. However, unlike the marble, even when it gets towards the lowest point in the depression, it’s not going to just rest there. The magical jumping bean is never at rest. It’s forever jump-stick-marking, jump-stick-marking.
However, once the magical bean has moved towards the depression, though it keeps moving, it’s likely never to move too far from the depression. Instead, it’s likely to bounce around the depression. And as it does so, it drops ever more marks on the surface, which keep showing what the surface looks like around the depression in ever more detail.
So, because of the behaviour of the jumping bean, you only have to act on it once, by choosing where to drop it, rather than twice as with the marble: first choosing where to drop it, then flicking it around once it’s reached the lowest point on the surface.
So what?
In the analogies above, the marble is to frequentist statistics as the jumping bean is to Bayesian statistics. A technical distinction between the marble and the jumping bean is that the marble converges towards a point (meaning it reaches a point of rest on the surface) whereas the jumping bean converges towards a distribution (meaning it never rests).
It’s Bayesian statistics’ 1 property of converging to a distribution rather than a point that makes the converged posterior distribution of parameter estimates Bayesian models produce ideal for the kind of honest prediction so much of this blog series has been focused on.
Let’s now do some Bayesian modelling to compare…
Bayesian modelling: now significantly less terrifying than it used to be
There are a lot of packages and approaches for building Bayesian models. In fact there are whole statistical programming languages - like JAGS, BUGS 2 and Stan - dedicated to precisely describing every assumption the statistician wants to make about how a Bayesian model should be built. For more complicated and bespoke models these are ideal.
However there are also an increasingly large number of Bayesian modelling packages that abstract away some of the assumptions and complexity apparent in the above specialised Bayesian modelling languages, and allow Bayesian versions of the kinds of model we’re already familiar with to be specified using formulae interfaces almost identical to what we’ve already worked with. Let’s look at one of them, rstanarm, which allows us to use stan, a full Bayesian statistical programming language, without quite as much thinking and set-up being required on our part.
Let’s try to use this to build a Bayesian equivalent of the hamster tooth model we worked on in the last couple of posts.
Data Preparation and Frequentist modelling
Let’s start by getting the dataset and building the frequentist version of the model we’re already familiar with:
best_model_frequentist <-lm(len ~log(dose) * supp, data = df)summary(best_model_frequentist)
Call:
lm(formula = len ~ log(dose) * supp, data = df)
Residuals:
Min 1Q Median 3Q Max
-7.5433 -2.4921 -0.5033 2.7117 7.8567
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.6633 0.6791 30.425 < 2e-16 ***
log(dose) 9.2549 1.2000 7.712 2.3e-10 ***
suppVC -3.7000 0.9605 -3.852 0.000303 ***
log(dose):suppVC 3.8448 1.6971 2.266 0.027366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.72 on 56 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.7635
F-statistic: 64.5 on 3 and 56 DF, p-value: < 2.2e-16
Building the Bayesian equivalent
Now how would we build a Bayesian equivalent of this? Firstly let’s load (and if necessary install3) rstanarm.
Code
library(rstanarm)
Whereas for the frequentist model we used the function lm(), rstanarm has what looks like a broadly equivalent function stan_lm(). However, as I’ve just discovered, it’s actually more straightforward with stan_glm instead:
Code
best_model_bayesian <-stan_glm(len ~log(dose) * supp, data = df)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: len ~ log(dose) * supp
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 60
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) 20.6 0.7 19.8 20.6 21.5
log(dose) 9.3 1.2 7.7 9.3 10.8
suppVC -3.7 1.0 -5.0 -3.7 -2.4
log(dose):suppVC 3.8 1.7 1.7 3.8 6.0
sigma 3.8 0.4 3.4 3.8 4.3
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 18.8 0.7 17.9 18.8 19.7
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 2675
log(dose) 0.0 1.0 2514
suppVC 0.0 1.0 3076
log(dose):suppVC 0.0 1.0 2439
sigma 0.0 1.0 3294
mean_PPD 0.0 1.0 3700
log-posterior 0.0 1.0 1767
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Some parts of the summary for the Bayesian model look fairly familiar compared with the frequentist model summary; other bits a lot more exotic. We’ll skip over a detailed discussion of these outputs for now, though it is worth comparing the estimates section of the summary directly above, from the Bayesian approach, with the frequentist model produced earlier.
The frequentist model had point estimates of \(\{20.7, 9.3, -3.7, 3.8\}\). The analogous section of the Bayesian model summary is the mean column of the estimates section. These are reported to fewer decimal places by default - Bayesians are often more mindful of spurious precision - but are also \(\{20.7, 9.3, -3.7, 3.8\}\), so the same to this number of decimal places.
Note also the Bayesian model reports an estimate for an additional parameter, sigma. This should be expected if we followed along with some of the examples using optim() for linear regression: the likelihood function required the ancillary parameters (referred to as \(\alpha\) in the ‘mother model’ which this series started with, and part of the stochastic component \(f(.)\)) be estimated as well as the primary model parameters (referred to as \(\beta\) in the ‘mother model’, and part of the systematic component \(g(.)\)). The Bayesian model’s coefficients (Intercept), log(dose), suppVC and the interaction term log(dose):suppVC are all part of \(\beta\), whereas the sigma parameter is part of \(\alpha\). The Bayesian model has just been more explicit about exactly which parameters it’s estimated from the data.
For the \(\beta\) parameters, the Std. Error column in the Frequentist model summary is broadly comparable with the sd column in the Bayesian model summary. For the \(\beta\) parameters these values are \(\{0.7, 1.2, 1.0, 1.7\}\) in the Frequentist model, and \(\{0.7, 1.2, 1.0, 1.7\}\) in the Bayesian model the summary. i.e. they’re the same to the degree of precision offered in the Bayesian model summary.
But let’s get to the crux of the argument: with Bayesian models honest predictions are easier.
And they are, with the posterior_predict() function, passing what we want to predict on through the newdata argument, much as we did with the predict() function with frequentist models.
Scenario modelling
Let’s recall the scenarios we looked at previously:
predicted and expected values: length when dosage is 1.25mg and supplement is OJ
first difference difference between OJ and VC supplement when dosage is 1.25mg
By default posterior_predict() returns a matrix, which in this case has 4000 rows and just a single column. Let’s do a little work on this and visualise the distribution of estimates it produces:
The darker-shaded parts of the histogram show the 95% uncertainty interval, and the blue vertical line the median estimate. This 95% interval range is 15.14 to 30.31.
Remember we previously estimated both the expected values and the predicted values for this condition. Our 95% range for the expected values were 20.27 to 24.19 (or thereabouts), whereas our 95% range for the predicted values were (by design) wider, at 15.34 to 30.11. The 95% uncertainty interval above is therefore of predicted values, which include fundamental variation due to the ancillary parameters \(\sigma\), rather than expected values, which result from parameter uncertainty alone.
There are a couple of other functions in rstanarm we can look at: predictive_error() and predictive_interval()
First here’s predictive_interval. It is a convenience function that the posterior distribution generated previously, predictions, and returns an uncertainty interval:
Code
predictive_interval( predictions)
5% 95%
1 16.25049 29.08077
We can see by default the intervals returned are from 5% to 95%, i.e. are the 90% intervals rather than the 95% intervals considered previously. We can change the intervals requested with the prob argument:
Code
predictive_interval( predictions, prob =0.95)
2.5% 97.5%
1 15.13993 30.30754
As expected, this requested interval returns an interval closer to (but not identical to) the interval estimated using the quantile function.
Let’s see if we can also use the model directly, specifying newdata directly to predictive_interval:
Yes. This approach works too. The values aren’t identical as, no doubt, a more sophisticated approach is used by predictive_interval to estimate the interval than simply arranging the posterior estimates in order using quantile.
For producing expected values we can use the function posterior_epred:
For comparison, the expected value 95% interval we obtained from the Frequentist model was 21.3 to 24.2 when drawing from the quasi-posterior distribution, and 22.7 to 24.2 when using the predict() function with the interval argument set to "confidence".
Now, finally, let’s see if we can produce first differences: the estimated effect of using VC rather than OJ as a supplement when the dose is 1.25mg
The newdata argument to posterior_predict now has two rows, one for the OJ supplement and the other for the VC supplement scenario. And the predictions matrix returned by posterior_predict now has two columns: one for each scenario (row) in predictors_fd. We can look at the distribution of both of these columns, as well as the rowwise comparisions between columns, which will give our distribution of first differences for the predicted values:
To reiterate, these are predicted values for the two scenarios, not the expected values shown in the first differences section of post 12. This explains why there is greater overlap between the two distributions. Let’s visualise and calculate the first differences in predicted values:
We can see that the average of the distribution is below 0, but as we are looking at predicted values the range of distributions is much higher. Let’s get 95% intervals:
The 95% intervals for first differences in predicted values is from -13.6 to +7.9, with the median estimate at -3.0. As expected, the median is similar to the equivalent value from using expected values (-2.9) but the range is wider.
Now let’s use posterior_epred to produce estimates of first differences in expected values, which will be more directly comparable to our first differences estimates in part 12:
This time, as the stochastic variation related to the \(\sigma\) term has been removed, the distributions of the expected values are more distinct, with less overlap. Let’s visualise and compare the first differences of the expected values:
We now have a 95% interval for the first difference in expected values of -4.9 to -0.7. By contrast, the equivalent range estimated using the Frequentist model in part 12 was -4.8 to -0.8. So, although they’re not identical, they do seem to be very similar.
Summing up
Up until now we’ve been using Frequentist approaches to modelling. However the simulation approach required to produce honest uncertainty depends on ‘tricking’ Frequentist models into producing something like the converged posterior distributions which, in Bayesian modelling approaches, come ‘for free’ from the way in which Bayesian frameworks estimate model parameters.
Although Bayesian models are generally more technically and computationally demanding than Frequentist models, we have shown the folllowing:
That packages like rstanarm abstract away some of the challenges of building Bayesian models from scratch;
That the posterior distributions produced by Bayesian models produce estimates of expected values, predicted values, and first differences - our substantive quantities of interest - that are similar to those produced previously from Frequentist models
That for the estimation of these quantities of interest, the posterior distributions Bayesian models generate make it more straightforward, not less, to produce using Bayesian methods than using Frequentist methods.
Thanks for reading, and congratulations on getting this far through the series.
Footnotes
Or perhaps more accurately Bayesian statistical model estimation rather than Bayesian statistics more generally? Bayes’ Rule can be usefully applied to interpret results derived from frequentist models. But the term Bayesian Modelling generally implies that Bayes’ Rule is used as part of the model parameter estimation process, in which a prior distribution is updated according to some algorithm, and then crucially the posterior distribution produced then forms the prior distribution at the next step in the estimation. The specific algorithm that works as the ‘jumping bean’ is usually something like Hamiltonian Monte Carlo, HMC, and the general simulation framework in which a posterior distribution generated from applying Bayes’ Rule is repeatedly fed back into the Bayes’ Rule equation as the prior distribution is known as Markov Chain Monte Carlo, MCMC.↩︎