Causal inference course note - Week 1

This is my note for the “A Crash Course in Causality: Inferring Causal Effects from Observational Data” course by Jason A. Roy on Coursera.



Table of Contents


Week 1. Welcome and Introduction to Causal Effects

A brief history

Statisticians started working on causal modeling as far back as the 1920s (Wright 1921; Neyman 1923). It became its own area of statistical research since about the 1970s.

Some highlights:

Going forward

In this course, we will primarily focus on causal inference from observational studies and natural experiments. It is important to remember that:

Potential outcomes and counterfactuals

Potential outcomes

Suppose we are interested in the causal effect of some treatment $A$ on some outcome $Y$.

Example:

Potential outcomes are the outcomes we would see under each possible treatment option. $Y^a$ is the outcome that would be observed if treatment was set to $A = a$. When the treatment is binary, each person has potential outcomes $Y^0$, $Y^1$.

Example 1. Suppose treatment is influenza vaccine and the outcome is the time until the individual gets the flu. In this case potential outcomes are:

Example 2. Suppose the treatment is regional ($A = 1$) versus general ($A = 0$) anesthesia for hip fracture surgery. The outcome ($Y$) is major pulmonary complications.

Counterfactuals

Counterfactual outcomes are ones that would have been observed, had the treatment been different.

If my treatment was $A = 1$, then my counterfactual outcome is $Y^0$. If my treatment was $A = 0$, then my counterfactual outcome is $Y^1$.

Example: Did influenza vaccine prevent me from getting the flu?

Before the treatment decision is made, any outcome is a potential outcome: $Y^0$ and $Y^1$. After the study, there is an observed outcome $Y = Y^A$, and counterfactual outcomes $Y^{1 - A}$.

Counterfactual outcomes $Y^0$, $Y^1$ are typically assumed to be the same as potential outcomes $Y^0$, $Y^1$.

Hypothetical interventions

Intervention

It is cleanest to think of causal effects of interventions or actions, in other words, causal effects of variables that can be manipulated. Holland (1986): “No causation without manipulation.” Causal effects of (hypothetical) interventions are generally well defined.

Example: Outcome if prescribed drug A vs. outcome if prescribed drug B.

One version of treatment

It is common to assume there are no hidden versions of treatment.

Example: If we were interested in the causal effect of body mass index (BMI) on health outcomes, we have a problem because:

Immutable variables

It is also less clear what a causal effect of an immutable variable would mean.

Example: Causal effect of race, gender, or age?

When we think of potential outcome $Y^a$, we imagine that we could, hypothetically, set treatment to $A = a$ and then see an outcome. With immutable variables this is not as well defined.

Manipulable vs. not manipulable

No direct intervention Manipulable with intervention
Race Name on resume
Obesity Bariatric surgery
Socioeconomic status Gift of money

Causal effects

For the remainder of the course, we will primarily focus on treatments/exposures that could be thought of as interventions.

Note: There are, of course, causal effects of variables like age, race, gender, and obesity, but they do not fit as cleanly in the potential outcomes framework.

We focus on causal effects of hypothetical interventions because:

In general, $A$ had a causal effect on $Y$ if $Y^1$ differs from $Y^0$.

Example:

Not a proper causal reasoning: “I took ibuprofen and my headache is gone, therefore the medicine worked.” This statement is equivalent to $Y^1 = 1$. What would have happened had you not taken ibuprofen, i.e. what is $Y^0$?

There is only a causal effect if $Y^1 \neq Y^0$.

Fundamental problem of causal inference

The fundamental problem of causal inference is that we can only observe one potential outcome for each person. However, with certain assumptions, we can estimate population level (average) causal effects.

Causal effects

Average causal effect

Given a population of interest, consider the following hypothetical worlds:

Average causal effect is the difference between mean($Y$) in World 1 and mean($Y$) in World 2. Formally we can write this as:

\[\mathbb{E}(Y^1 - Y^0)\]

Example 1: Regional ($A = 1$) versus general ($A = 0$) anesthesia for hip fracture surgery on risk of major pulmonary complications.

Example 2: Treatment is thiazide diuretic ($A = 1$) or no treatment ($A = 0$) among hypertensive patients. Outcome $Y$ is systolic blood pressure.

Conditioning on vs. setting treatment

In general,

\[\mathbb{E}(Y^1 - Y^0) \neq \mathbb{E}(Y \vert A = 1) - \mathbb{E}(Y \vert A = 0)\]

Why? $\mathbb{E}(Y \vert A = 1)$ is the expected value of $Y$ given $A = 1$. This is restricting to the subpopulation of people who actually had $A = 1$. They might differ from the whole population in important ways. For example, people at higher risk for flu might be more likely to choose to get a flu shot.

$\mathbb{E}(Y \vert A = 1) - \mathbb{E}(Y \vert A = 0)$ is generally not a causal effect, because it is comparing two different populations of people. On the other hand, $\mathbb{E}(Y^1 - Y^0)$ is a causal effect, because it is comparing what would happen if the same people were treated with $A = 1$ vs. if the same people were treated with $A = 0$.

Other causal effects

We might be interested in other causal effects, depending on the particulars of our study, our research question, or even the data available to us. Some examples:

Challenge

We only observe one treatment and one outcome for each person (fundamental problem of causal inference).

Causal assumptions

Identifiability

Identifiability of causal effects requires making some untestable assumptions. These are generally called causal assumptions. The most common are:

Assumptions will be about the observed data: $Y$, $A$, and a set of pre-treatment covariates $X$.

SUTVA

The Stable Unit Treatment Value Assumption (SUTVA) really involves two assumptions.

SUTVA allows us to write potential outcome for the $i$-th person in terms of only that person’s treatments.

Consistency

The consistency assumption: The potential outcome $Y^a$ under treatment $A = a$ is equal to the observed outcome if the actual treatment received is $A = a$. In other words,

\[Y = Y^a \text{ if } A = a, \text{ for all } a.\]

Ignorability

The ignorability assumption, also called “no unmeasured confounders” assumption: Given pre-treatment covariates $X$, treatment assignment is independent from the potential outcomes. Formally,

\[Y^0, Y^1 \indep A ~\vert~ X\]

Among people with the same values of $X$, we can think of treatment $A$ as being “randomly” assigned. Here, “random” just means being independent of the potential outcomes.

Toy example:

Here, $Y^0$ and $Y^1$ are not (marginally) independent from $A$. However, within levels of $X$, treatment might be randomly assigned, in which case the ignorability assumption would hold.

Positivity

The positivity assumption:

\[P(A = a \vert X = x) > 0 \text{ for all } a \text{ and } x.\]

In the binary treatment setting, this essentially states that, for every set of values for $X$, treatment assignment was not deterministic. If, for some values of $X$, treatment was deterministic, then we would have no observed values of $Y$ for one of the treatment groups for those values of $X$.

Variability in treatment assignment is important for identification.

Observed data and potential outcomes

We can put these assumptions together to identify causal effects from observed data. Note that $\mathbb{E}(Y \vert A = a, X = x)$ involves only observed data.

\[\begin{aligned} \mathbb{E}(Y \vert A = a, X = x) &= \mathbb{E}(Y^a \vert A = a, X = x) \text{ by consistency.} \\ &= \mathbb{E}(Y^a | X = x) \text { by ignorability.} \end{aligned}\]

If we want a marginal causal effect, we can compute $\mathbb{E}(Y^a)$ by averaging over X.

Stratification / standardization

Conditioning and marginalizing

Previously we saw that under certain causal assumptions,

\[\mathbb{E}(Y \vert A = a, X = x) = \mathbb{E}(Y^a | X = x).\]

If we want marginal causal effect, we can average over the distribution of $X$. For example, if $X$ is a single categorical variable, then

\[\mathbb{E}(Y^a) = \sum_{x} \mathbb{E}(Y \vert A = a, X = x) P(X = x).\]

This is known as standardization.

Standardization

Standardization involves stratifying and then averaging. We can obtain a treatment effect within each stratum and then pool across stratum, weighting by the probability (size) of each stratum. From data, you could estimate a treatment effect by computing means under each treatment within each stratum, and then pooling across stratum.

Standardization Example

Consider a study comparing two diabetes treatments, where we have new initiators of saxagliptin vs. sitagliptin.

Raw data (unstratified):

  MACE=yes MACE=no Total
Saxa=yes 350 3650 4000
Saxa=no 500 6500 7000
Total 750 10250 11000

But this does not say anything about causal effect. When we stratify the data by the prior OAD use, we have the following tables.

  1. Prior OAD use = no:

      MACE=yes MACE=no Total
    Saxa=yes 50 950 1000
    Saxa=no 200 3800 4000
    Total 250 4750 5000
  2. Prior OAD use = yes:

      MACE=yes MACE=no Total
    Saxa=yes 300 2700 3000
    Saxa=no 300 2700 3000
    Total 600 5400 6000

Observe that saxa users more likely to have prior OAD use. Also people with prior OAD use are at higher risk for MACE (regardless of treatment).

Here we observe no difference in terms of treatment effectiveness of two drugs.

Mean potential outcome for saxagliptin is:

\[\begin{aligned} \mathbb{E}(Y^{\text{saxa}}) &= \frac{300}{3000} \cdot \frac{6000}{11000} + \frac{50}{1000} \cdot \frac{5000}{11000} \\ &= 0.077 \end{aligned}\]

Mean potential outcome for sitagliptin is:

\[\begin{aligned} \mathbb{E}(Y^{\text{sita}}) &= \frac{300}{3000} \cdot \frac{6000}{11000} + \frac{200}{4000} \cdot \frac{5000}{11000} \\ &= 0.077 \end{aligned}\]

Hence, the average causal effect is zero.

Problem with standardization

Typically, there will be many $X$ variables needed to achieve ignorability. In this case stratification would lead to many empty cells. For example, if you stratify on age and blood pressure, there will be many combinations of age and blood pressure for which you have no data. Thus we need alternatives to standardization.

In the coming weeks, we will explore several popular methods for estimating causal effects: matching, inverse probability of treatment weighting, propensity score methods, etc.

Comments