Code
<- mean(c(4.8, 3.7, 3.1))
e_y_z1 <- mean(c(2.3, 3.4, 2.9))
e_y_z0
# And the difference?
- e_y_z0 e_y_z1
[1] 1
Jon Minton
February 22, 2024
This is the second post on a short mini-series on causal inference. The previous post provided a non-technical introduction to the core challenge of causal inference, namely that the counterfactual is always unobserved, meaning at least half of the data required to really know the causal effect of something is always missing. In the previous post different historians made different assumptions about what the counterfactual would have looked like - what would have happened if something that did happen, hadn’t happened - and based on this came to very different judgements about the effect that Henry Dundas, an 18th century Scottish politician, had on the transatlantic slave trade.
This post is more technical, aiming to show: how awkward phrases like “What would have happened if something that did happen, hadn’t happened” are expressed algebraically; how the core problem of causal inference is expressed in this framework; the technical impossibility of addressing the question of causal inference from the Platinum Standard of estimating causal effects on individuals; and describe the reason why randomised controlled trials (RCTs) provide the Gold Standard for trying to estimate these effects for populations.
The first stage when using a statistical model is to take a big rectangle of data, \(D\), and split the columns of the data into two types:
With the predictor variables and the response variables defined, the challenge of model fitting is then to find some combination of model parameters \(\theta\) that minimises in some way the gap between the observed response values \(y\), and the predicted response values from the model \(Y\).
The first point to note is that, from the perspective of the model, it does not matter which variable or variables from \(D\) we choose to put in the predictor side \(X\) or the response side \(y\). Even if we put a variable from the future in as a predictor of something in the past, the optimisation algorithms will still work in exactly the same way, working to minimise the gap between observed and predicted responses. The only problem is such a model would make no sense from a causal perspective.
The model also does not ‘care’ about how we think about and go about defining any of the variables that go into the predictor side of the equation, \(X\). But again, we do. In particular, when thinking about causality it can be immensely helpful to imagine splitting the predictor columns up into some conceptually different types. This will be helpful for thinking about causal inference using some algebra.
In some previous expressions of the data, \(D\), we used the subscript \(i\) to indicate the rows of the data which go into the model. Each of these rows is, by convention, a different observation. So, instead of saying the purpose of the model is to predict \(y\) on \(X\), it’s more precisely to predict \(y_i\) on \(X_i\), for all \(i\) in the data (i.e. all rows in \(D\)).
Now let’s do some predictor variable fission and say, for our purposes, that:
\[ X_i = \{X_i^*, Z_i\} \]
Here \(Z_i\) is an assignment variable, and takes either a value of 1
, meaning ‘is assigned’, or 0
, meaning ‘is not assigned’. The variable \(X_i^*\), by contrast, means ‘all other predictor variables’.
For individual observations \(D_i\) where \(Z_i = 1\), the individual is exposed (or treated) to something. And for individual observations \(D_i\) where \(Z_i = 0\), the individual is not exposed (or not treated) to that thing.
The causal effect of assignment, or treatment, for any individual observation is:
\[ TE_i = y_i|(X_i^*, Z = 1) - y_i| (X_i^*, Z = 0) \]
The fundamental problem of causal inference, however, is that for any individual observation \(i\), one of the two parts of this expression is always missing. If an individual \(i\) had been assigned, then \(y_i|(X_i^*, Z=1)\) is observed, but \(y_i|(X_i^*, Z=0)\) is unobserved. By contrast, if an individual \(i\) had not been assigned, then \(y_i|(X_i^*, Z=0)\) is observed, but \(y_i|(X_i^*, Z=1)\) is unobserved.
Another way to think about this is as a table, where the treatment effect for an individual involves comparing the outcomes reported in two columns of the same row, but the cells in one of these two columns is always missing:
individual | outcome if treated | outcome if not treated | treatment effect |
---|---|---|---|
1 | 4.8 | ?? | ?? |
2 | 3.7 | ?? | ?? |
3 | ?? | 2.3 | ?? |
4 | 3.1 | ?? | ?? |
5 | ?? | 3.4 | ?? |
6 | ?? | 2.9 | ?? |
The Platinum Standard of causal effect estimation would therefore be if the missing cells in the outcome columns could be accurately filled in, allowing the treatment effect for each individual to be calculated.
However, this isn’t possible. It’s social science fiction, as we can’t split the universe and compare parallel realities: one in which what happened didn’t happen, and the other in which what didn’t happen happened.
So, what can be done?
There’s one thing you might be tempted to do with the kind of data shown in the table above: compare the average outcome in the treated group with the average outcome in the untreated group, i.e.:
\[ ATE = E(y | Z = 1) - E(y | Z = 0) \]
Let’s do this with the example above:
[1] 1
In this example, the difference in the averages between the two groups is 1.0
.1 Based on this, we might imagine the first individual, who was treated, would have had a score of 3.8
rather than 4.8
, and the third individual, who was not treated, would have received a score of 3.3
rather than 2.3
if they had been treated.
So, what’s the problem with just comparing the averages in this way? Potentially, nothing. But potentially, a lot. It depends on the data and the problem. More specifically, it depends on the relationship between the assignment variable, \(Z\), and the other characteristics of the individual, which includes but is not usually entirely captured by the known additional characteristics of the individual, \(X_i^*\).
Let’s give a specific example: What if I were to tell you that the outcomes \(y_i\) were waiting times at public toilets/bathrooms, and the assignment variable, \(Z\), takes the value 1
if the individual has been assigned to a facility containing urinals, and 0
if the individual has been assigned to a facility containing no urinals? Would it be right to infer that the difference in the average is the average causal effect of urinals in public toilets/bathrooms?
I’d suggest not, because there are characteristics of the individual which govern assignment to bathroom type. What this means is that \(Z_i\) and \(X_i^*\) are coupled or related to each other in some way. So, any difference in the average outcome between those assigned to (or ‘treated with’) urinals could be due to the urinals themselves; or could be due to other ways that ‘the treated’ and ‘the untreated’ differ from each other systematically. We may be able to observe a difference, and to report that it’s statistically significant. But we don’t know how much, if any, of that difference is due to the exposure or treatment of primary interest to us, and how much is due to other ways in the ‘treated’ and ‘untreated’ groups differ.
So, we need some way of breaking the link between \(Z\) and \(X^*\). How do we do this?
The clue’s in the subheading. Randomised Controlled Trials (RCTs) are known as the Gold Standard for scientific evaluation of effects for a reason, and the reason is this: they’re explicitly designed to break the link between \(Z\) and \(X^*\). And not just \(X^*\), but any unobserved or unincluded characteristics of the individuals, \(W^*\), which might also otherwise influence assignment or selection to \(Z\) but we either couldn’t measure or didn’t choose to include.
The key idea of an RCT is that assignment to either a treated or untreated group, or to any additional arms of the trial, has nothing to do with the characteristics of any individual in the trial. Instead, the allocation is random, determined by a figurature (or historically occasionally literal) coin toss. 2
What this random assignment means is that assignment \(Z\) should be unrelated to the known characteristics \(X^*\), as well as unknown characteristics \(W^*\). The technical term for this (if I remember correctly) is that assignment is orthogonal to other characteristics, represented algebraically as \(Z \perp X^*\) and \(Z \perp W^*\).
This doesn’t mean that, for any particular trial, there will be zero correlation between \(Z\) and other characteristics. Nor does it mean that the characteristics of participants will be the same across trial arms. Because of random variation there are always going to be differences between arms in any specific RCT. However, we know that, because we are aware of the mechanism used to allocate participants to treated or non-treated groups (or more generally to trial arms), the expected difference in characteristics will be zero across many RCTs. Along with increased observations, this is the reason why, in principle, a meta-analysis of methodologically identical RCTs should offer even greater precision as to the causal effect of a treatment than just relying on a single RCT. 3
A key point to note is that, when analysing a properly conducted RCT to estimate a treatment effect, the ATE formula shown above, which is naive and likely to be biased when working with observational data, is likely to produce an unbiased estimate of the treatment effect. Because the trial design is sophisticated in the way it breaks the link between \(Z\) and everything else, the statistical analysis does not have to be sophisticated.
The flip side of this, however, is that when the data are observational, and it would be naive (as with the urinals and waiting times example) to assume that \(Z\) is unlinked to everything else known (\(X^*\)) and unknown (\(W^*\)), then more careful and bespoke statistical modelling approaches are likely to be required to recover non-biased causal effects. Such modelling approaches need to be mindful of both the platinum and gold standards presented above, and rely on modelling and other assumptions to try to simulate what the treatment effects would be if these unobtainable (platinum) and unobtained (gold) standards had been obtained.
The next post will start to delve into some of these approaches.
This is pure fluke. I didn’t choose the values to get a difference of exactly 1, but there we go…↩︎
In the gold-plated gold standard of the double-blind RCT, not even the people running the trial and interacting with participants would be aware of which treatment a participant has been assigned. They would simply be given a participant ID, find a pack containing the participant’s treatment, and give this pack to the participant. Only a statistician, who has access to a random number cypher, would know which participants are assigned to which treatment, and they might not know until the trial has concluded. The idea of all of these layers of secrecy in assignment is to reduce the possibility that those running the experiment could intentionally or unintentially inform participants about which treatment they’re receiving, and so create expectations in participants about the effectiveness or otherwise of the treatments, which could have an additional effect on the outcomes.↩︎
In practice, issues like methodological variation, and publication bias, mean that meta-analyses of RCTs are unlikely to provide as accurate and unbiased an estimate of treatment effect as we would hope for.↩︎