on , last updated on
Causal inference course note - Week 1
This is my note for the “A Crash Course in Causality: Inferring Causal Effects from Observational Data” course by Jason A. Roy on Coursera.
- Week 1. Welcome and Introduction to Causal Effects
- Week 2. Confounding and Directed Acyclic Graphs (DAGs)
- Week 3. Matching and Propensity Scores
- Week 4. Inverse Probability of Treatment Weighting (IPTW)
- Week 5. Instrumental Variables Methods
Table of Contents
- Week 1. Welcome and Introduction to Causal Effects
Week 1. Welcome and Introduction to Causal Effects
A brief history
Statisticians started working on causal modeling as far back as the 1920s (Wright 1921; Neyman 1923). It became its own area of statistical research since about the 1970s.
Some highlights:
- Re-introduction of potential outcomes; Rubin causal model (Rubin 1974).
- Causal diagrams (Greenland and Robins 1986; Pearl 2000).
- Propensity scores (Rosenbaum and Rubin 1983).
- Time-dependent confounding (Robins 1986; Robins 1997).
- Optimal dynamic treatment strategies (Murphy 2003; Robins 2004).
Going forward
In this course, we will primarily focus on causal inference from observational studies and natural experiments. It is important to remember that:
- Causal inference requires making some untestable assumptions (“causal assumptions”).
- Cochran (1972) concludes: “..observational studies are an interesting and challenging field which demands a good deal of humility, since we can claim only to be groping toward the truth.”
Potential outcomes and counterfactuals
Potential outcomes
Suppose we are interested in the causal effect of some treatment $A$ on some outcome $Y$.
Example:
- $A = 1$ if received influenza vaccine; $A=0$ otherwise.
- $Y = 1$ if develop cardiovascular disease within 2 years; $Y = 0$ otherwise.
Potential outcomes are the outcomes we would see under each possible treatment option. $Y^a$ is the outcome that would be observed if treatment was set to $A = a$. When the treatment is binary, each person has potential outcomes $Y^0$, $Y^1$.
Example 1. Suppose treatment is influenza vaccine and the outcome is the time until the individual gets the flu. In this case potential outcomes are:
- $Y^1$: time until the individual would get the flu if they received the flu vaccine.
- $Y^0$: time until the individual would get the flu if they did not receive the flu vaccine.
Example 2. Suppose the treatment is regional ($A = 1$) versus general ($A = 0$) anesthesia for hip fracture surgery. The outcome ($Y$) is major pulmonary complications.
- $Y^1$: 1 if major pulmonary complications, 0 otherwise, if given regional anesthesia.
- $Y^0$: 1 if major pulmonary complications, 0 otherwise, if given general anesthesia.
Counterfactuals
Counterfactual outcomes are ones that would have been observed, had the treatment been different.
If my treatment was $A = 1$, then my counterfactual outcome is $Y^0$. If my treatment was $A = 0$, then my counterfactual outcome is $Y^1$.
Example: Did influenza vaccine prevent me from getting the flu?
- What actually happened:
- I got the vaccine and did not get sick.
- My actual exposure was $A = 1$.
- My observed outcome was $Y = Y^1$.
- What would have happened (contrary to fact):
- Had I not gotten the vaccine, would I have gotten sick?
- My counterfactual exposure is $A = 0$.
- My counterfactual outcome is $Y^0$
Before the treatment decision is made, any outcome is a potential outcome: $Y^0$ and $Y^1$. After the study, there is an observed outcome $Y = Y^A$, and counterfactual outcomes $Y^{1 - A}$.
Counterfactual outcomes $Y^0$, $Y^1$ are typically assumed to be the same as potential outcomes $Y^0$, $Y^1$.
Hypothetical interventions
Intervention
It is cleanest to think of causal effects of interventions or actions, in other words, causal effects of variables that can be manipulated. Holland (1986): “No causation without manipulation.” Causal effects of (hypothetical) interventions are generally well defined.
Example: Outcome if prescribed drug A vs. outcome if prescribed drug B.
One version of treatment
It is common to assume there are no hidden versions of treatment.
Example: If we were interested in the causal effect of body mass index (BMI) on health outcomes, we have a problem because:
- There are many potential ways in which one could achieve a BMI of a particular value. These different ways might also be associated with different outcomes.
- Weight is not directly manipulable. It is better to think of causal effects of interventions that aim at manipulating weight.
Immutable variables
It is also less clear what a causal effect of an immutable variable would mean.
Example: Causal effect of race, gender, or age?
When we think of potential outcome $Y^a$, we imagine that we could, hypothetically, set treatment to $A = a$ and then see an outcome. With immutable variables this is not as well defined.
Manipulable vs. not manipulable
No direct intervention | Manipulable with intervention |
---|---|
Race | Name on resume |
Obesity | Bariatric surgery |
Socioeconomic status | Gift of money |
Causal effects
For the remainder of the course, we will primarily focus on treatments/exposures that could be thought of as interventions.
- Treatments that we can imagine being randomized (manipulated) in a hypothetical trial (even if they may be difficult to perform due to ethical reasons, etc. in reality).
Note: There are, of course, causal effects of variables like age, race, gender, and obesity, but they do not fit as cleanly in the potential outcomes framework.
We focus on causal effects of hypothetical interventions because:
- Their meaning is well-defined.
- They are potentially actionable.
In general, $A$ had a causal effect on $Y$ if $Y^1$ differs from $Y^0$.
Example:
- $Y$: headache gone one hour from now (Yes = 1, No = 0).
- $A$: take ibuprofen ($A = 1$) or not ($A = 0$).
Not a proper causal reasoning: “I took ibuprofen and my headache is gone, therefore the medicine worked.” This statement is equivalent to $Y^1 = 1$. What would have happened had you not taken ibuprofen, i.e. what is $Y^0$?
There is only a causal effect if $Y^1 \neq Y^0$.
Fundamental problem of causal inference
The fundamental problem of causal inference is that we can only observe one potential outcome for each person. However, with certain assumptions, we can estimate population level (average) causal effects.
- Hopeless: What would have happened to me had I not taken ibuprofen? (unit level or individual level causal effect)
- Possible: What would the rate of headache remission be if everyone took ibuprofen when they had a headache vs. if no one did? (population level causal effect)
Causal effects
Average causal effect
Given a population of interest, consider the following hypothetical worlds:
- World 1: Everyone gets $A = 0$ $\implies$ compute mean($Y$).
- World 2: Everyone gets $A = 1$ $\implies$ compute mean($Y$).
Average causal effect is the difference between mean($Y$) in World 1 and mean($Y$) in World 2. Formally we can write this as:
\[\mathbb{E}(Y^1 - Y^0)\]- Average value of $Y$ if everyone was treated with $A = 1$ minus the average value of $Y$ if everyone was treated with $A = 0$.
- If $Y$ is binary, this is a risk difference.
Example 1: Regional ($A = 1$) versus general ($A = 0$) anesthesia for hip fracture surgery on risk of major pulmonary complications.
- Suppose $\mathbb{E}(Y^1 - Y^0) = -0.1$
- Probability of major pulmonary complications is lower by 0.1 if given regional anesthesia compared with general anesthesia.
- If 1000 people were going to have hip fracture surgery, we would expect 100 fewer people to have pulmonary complications under regional anesthesia compared with general anesthesia.
Example 2: Treatment is thiazide diuretic ($A = 1$) or no treatment ($A = 0$) among hypertensive patients. Outcome $Y$ is systolic blood pressure.
- Suppose $\mathbb{E}(Y^1 - Y^0) = -20$ mmHg.
- If the population of hypertensive patients took thiazide diuretics, their average systolic blood pressure would be 20 mmHg lower than if they did not take anti-hypertensive medication.
Conditioning on vs. setting treatment
In general,
\[\mathbb{E}(Y^1 - Y^0) \neq \mathbb{E}(Y \vert A = 1) - \mathbb{E}(Y \vert A = 0)\]Why? $\mathbb{E}(Y \vert A = 1)$ is the expected value of $Y$ given $A = 1$. This is restricting to the subpopulation of people who actually had $A = 1$. They might differ from the whole population in important ways. For example, people at higher risk for flu might be more likely to choose to get a flu shot.
- $\mathbb{E}(Y \vert A = 1)$: Mean of $Y$ among people with $A = 1$.
- $\mathbb{E}(Y^1)$: Mean of $Y$ if the whole population was treated with $A = 1$.
$\mathbb{E}(Y \vert A = 1) - \mathbb{E}(Y \vert A = 0)$ is generally not a causal effect, because it is comparing two different populations of people. On the other hand, $\mathbb{E}(Y^1 - Y^0)$ is a causal effect, because it is comparing what would happen if the same people were treated with $A = 1$ vs. if the same people were treated with $A = 0$.
Other causal effects
We might be interested in other causal effects, depending on the particulars of our study, our research question, or even the data available to us. Some examples:
- $\mathbb{E}(Y^1 / Y^0)$: Causal relative risk.
- $\mathbb{E}(Y^1 - Y^0 \vert A = 1)$: Causal effect of treatment on the treated. We may be interested in how well treatment works among treated people.
- $\mathbb{E}(Y^1 - Y^0 \vert V = v)$: Average causal effect in the subpopulation with covariate $V = v$.
Challenge
We only observe one treatment and one outcome for each person (fundamental problem of causal inference).
- How do we use observed data to link observed outcomes to potential outcomes?
- What assumptions are necessary to estimate causal effects from observed data?
Causal assumptions
Identifiability
Identifiability of causal effects requires making some untestable assumptions. These are generally called causal assumptions. The most common are:
- Stable Unit Treatment Value Assumption (SUTVA)
- Consistency
- Ignorability
- Positivity
Assumptions will be about the observed data: $Y$, $A$, and a set of pre-treatment covariates $X$.
SUTVA
The Stable Unit Treatment Value Assumption (SUTVA) really involves two assumptions.
- No interference.
- Units do not interfere with each other, i.e. treatment assignment of one unit does not affect the outcome of another unit.
- Interference is also called spillover or contagion.
- One version of treatment.
SUTVA allows us to write potential outcome for the $i$-th person in terms of only that person’s treatments.
Consistency
The consistency assumption: The potential outcome $Y^a$ under treatment $A = a$ is equal to the observed outcome if the actual treatment received is $A = a$. In other words,
\[Y = Y^a \text{ if } A = a, \text{ for all } a.\]Ignorability
The ignorability assumption, also called “no unmeasured confounders” assumption: Given pre-treatment covariates $X$, treatment assignment is independent from the potential outcomes. Formally,
\[Y^0, Y^1 \indep A ~\vert~ X\]Among people with the same values of $X$, we can think of treatment $A$ as being “randomly” assigned. Here, “random” just means being independent of the potential outcomes.
Toy example:
- $X$ is a single variable (age) that can take values “younger” or “older”.
- Older people are more likely to get treatment $A = 1$.
- Older people are also more likely to have the outcome (hip fracture), regardless of treatment.
Here, $Y^0$ and $Y^1$ are not (marginally) independent from $A$. However, within levels of $X$, treatment might be randomly assigned, in which case the ignorability assumption would hold.
Positivity
The positivity assumption:
\[P(A = a \vert X = x) > 0 \text{ for all } a \text{ and } x.\]In the binary treatment setting, this essentially states that, for every set of values for $X$, treatment assignment was not deterministic. If, for some values of $X$, treatment was deterministic, then we would have no observed values of $Y$ for one of the treatment groups for those values of $X$.
Variability in treatment assignment is important for identification.
Observed data and potential outcomes
We can put these assumptions together to identify causal effects from observed data. Note that $\mathbb{E}(Y \vert A = a, X = x)$ involves only observed data.
\[\begin{aligned} \mathbb{E}(Y \vert A = a, X = x) &= \mathbb{E}(Y^a \vert A = a, X = x) \text{ by consistency.} \\ &= \mathbb{E}(Y^a | X = x) \text { by ignorability.} \end{aligned}\]If we want a marginal causal effect, we can compute $\mathbb{E}(Y^a)$ by averaging over X.
Stratification / standardization
Conditioning and marginalizing
Previously we saw that under certain causal assumptions,
\[\mathbb{E}(Y \vert A = a, X = x) = \mathbb{E}(Y^a | X = x).\]If we want marginal causal effect, we can average over the distribution of $X$. For example, if $X$ is a single categorical variable, then
\[\mathbb{E}(Y^a) = \sum_{x} \mathbb{E}(Y \vert A = a, X = x) P(X = x).\]This is known as standardization.
Standardization
Standardization involves stratifying and then averaging. We can obtain a treatment effect within each stratum and then pool across stratum, weighting by the probability (size) of each stratum. From data, you could estimate a treatment effect by computing means under each treatment within each stratum, and then pooling across stratum.
Standardization Example
Consider a study comparing two diabetes treatments, where we have new initiators of saxagliptin vs. sitagliptin.
- Outcome: Major Adverse Cardiac Event (MACE).
- Challenge:
- Saxagliptin users were more likely to have had past use of some other oral antidiabetic (OAD) drug.
- Patients with past use of OAD drugs are at higher risk for MACE.
- Main idea:
- Compute rate of MACE for saxagliptin and sitagliptin initiators in two subpopulations:
- patients who have had no prior OAC use.
- patients who have had prior OAC use.
- Then take weighted average, where weights are based on proportion of people in each subpopulation.
- This is a causal effect if, within levels of the prior OAD use variable, treatment can be thought of as randomized (i.e. ignorability given prior OAD use).
- Compute rate of MACE for saxagliptin and sitagliptin initiators in two subpopulations:
Raw data (unstratified):
MACE=yes | MACE=no | Total | |
---|---|---|---|
Saxa=yes | 350 | 3650 | 4000 |
Saxa=no | 500 | 6500 | 7000 |
Total | 750 | 10250 | 11000 |
- $P(\text{MACE} \vert \text{Saxa=yes}) = 350/4000 = 0.088$
- $P(\text{MACE} \vert \text{Saxa=no}) = 500/7000 = 0.071$
But this does not say anything about causal effect. When we stratify the data by the prior OAD use, we have the following tables.
-
Prior OAD use = no:
MACE=yes MACE=no Total Saxa=yes 50 950 1000 Saxa=no 200 3800 4000 Total 250 4750 5000 -
Prior OAD use = yes:
MACE=yes MACE=no Total Saxa=yes 300 2700 3000 Saxa=no 300 2700 3000 Total 600 5400 6000
Observe that saxa users more likely to have prior OAD use. Also people with prior OAD use are at higher risk for MACE (regardless of treatment).
- For Prior OAD use = no:
- $P(\text{MACE} \vert \text{Saxa=yes}) = 50/1000 = 0.05$
- $P(\text{MACE} \vert \text{Saxa=no}) = 200/4000 = 0.05$
- For Prior OAD use = yes:
- $P(\text{MACE} \vert \text{Saxa=yes}) = 300/3000 = 0.10$
- $P(\text{MACE} \vert \text{Saxa=no}) = 300/3000 = 0.10$
Here we observe no difference in terms of treatment effectiveness of two drugs.
Mean potential outcome for saxagliptin is:
\[\begin{aligned} \mathbb{E}(Y^{\text{saxa}}) &= \frac{300}{3000} \cdot \frac{6000}{11000} + \frac{50}{1000} \cdot \frac{5000}{11000} \\ &= 0.077 \end{aligned}\]Mean potential outcome for sitagliptin is:
\[\begin{aligned} \mathbb{E}(Y^{\text{sita}}) &= \frac{300}{3000} \cdot \frac{6000}{11000} + \frac{200}{4000} \cdot \frac{5000}{11000} \\ &= 0.077 \end{aligned}\]Hence, the average causal effect is zero.
Problem with standardization
Typically, there will be many $X$ variables needed to achieve ignorability. In this case stratification would lead to many empty cells. For example, if you stratify on age and blood pressure, there will be many combinations of age and blood pressure for which you have no data. Thus we need alternatives to standardization.
In the coming weeks, we will explore several popular methods for estimating causal effects: matching, inverse probability of treatment weighting, propensity score methods, etc.