How do we infer causality?

Much of our day-to-day analytics concern analyzing correlations among variables, e.g., the correlation between weather and oil futures, and we increasingly use advanced machine learning (or AI) tools to define clearer correlations between variables. Correlation describes the statistical relationship between two observed variables. Correlation has no necessary relationship with cause and effect. You can measure the correlation between happiness and longevity with great precision and yet know nothing about whether making someone happier will improve their longevity. (Let’s say that happier people live longer. Perhaps these people are happy precisely because they feel fortunate to have lived so long.)

Science advances primarily through analyzing cause and effect relationships, not by documenting correlations (though correlations are not irrelevant). In Maplerivertree, we believe that good causality inference can help businesses form first principles.

The fundamental problem of causal inference

Causal questions, however, are much harder to answer than the correlational questions. Here’s why: Correlations are readily measured from observational data. Causal effects can never be measured directly (to be explained). The reason that causal questions are difficult to answer is that they intrinsically concern a counterfactual state—that is, something that by definition has not or did not occur. To be concrete, let’s examine a simple case where we have a binary causal variable X ∈ {0, 1} and a response variable Y (which may be continuous). If X has a causal effect on Y , this implies that the value of Y we observe depends on the value of X. We can define Y0 and Y1 as the values that Y would take if X were equal to 0 and 1 respectively. Y0 and Y1 are counterfactuals of one another. If X = 0 then we will have Y = Y0. The counterfactual question is what value of Y would we have observed if instead we had X = 1. Obviously, the answer is Y1. But Y1 is a notional concept. We can posit that both Y0 and Y1 have well defined values, but we will never see both of them.

Let Yi be the outcome of interest for unit i, where i could be a person, a cell, a drop of water, a sovereign country. We’ll suppress the i subscript where possible. We want to consider two possible outcomes for i. Let Y0 be the value of Y for X = 0 and Y1 be the value of Y for X = 1, as above. Thus, for every unit i, we can imagine two potential outcomes {Y0, Y1} that we would observe if the unit were untreated (X = 0) or treated (X = 1). We observe either Y0 or Y1, but we assume that both are well defined. That is, there is a precise alternative state of the world that would have occurred had we chosen X = 1 instead of X = 0 or vice versa.

In this framework, the causal effect of X on Y is T = Y1 − Y0, where T stands for Treatment Effect. The problem that this immediately reveals is that we never observe Y1 − Y0 for an individual i Instead, we observe Yi = Y1iXi + Y0i (1 − Xi) .

That is, we observe Y1 or Y0 but not both.

Fundamental Problem of Causal Inference: It is not possible to observe the value Y1i and Y0i for the same unit i, so we cannot measure the causal effect of X on Y for unit i.

Natural question: Why can’t we just switch X from 0 to 1 and back to observe both Y1 and Y0? In fact, this procedure is not informative about Y1 − Y0 without further assumptions (discussed below). One useful observation to build intuition: many causal relationships are irreversible. If X corresponds to attending MIT vs. another college and Y corresponds to post-college earnings, we can either observe your post-MIT earnings or your post-non-MIT earnings, not both.

Solving the fundamental problem of causal inference

Since the problem is fundamental, there is no solution. But there are several “work-arounds.”

Work-around I: Postulate stability and reversibility (AKA ’causal transience’)

One work around is to assume stability and reversibility (what Holland calls temporal stability and causal transience). If the causal effect of X on Y is the same at every point in time (now and the future) and the causal effect of X on Y is reversible (so having once been exposed X doesn’t permanently change the effect of X on Y ), then we can observe Y1i − Y0i simply by repeatedly changing X from 0 to 1. Formally, these assumptions are: Y1it = Y1i, Y0it = Y0i for all i and t where t indexes time. Of course, temporal stability and causal transience are postulates. They cannot be tested. Example: You can turn water from ice to steam and back repeatedly to analyze the causal effect of temperature change on water molecules. But what allows you to make the causal inference that steam is the counterfactual for ice when the treatment is 100 degrees versus 0 degrees Celsius are the postulates that (1) water molecules are not fundamentally altered by heating and cooling; and (2) that the relationship between temperature and the behavior of water is stable (e.g., does not depend on the phase of the moon). Counter-example: It would probably not be valid to assess the effectiveness of a treatment for high cholesterol for patient i by repeatedly administering the cholesterol reducing treatment, testing the patient’s cholesterol level, then withdrawing the treatment, testing the patient’s cholesterol level, etc. Cholesterol levels are sluggish state variables. And they might be permanently affected by even a one-time treatment.

Work-around II: Postulate homogeneity

We may alternatively assume unit homogeneity. If Y1i and Y0i are identical for all i, we can measure the causal effect of X on Y simply by taking the difference Y1i − Y0j for i ̸= j. Of course, unit homogeneity is also a postulate; one cannot know that two things are identical in all respects. But under certain laboratory conditions, unit homogeneity seems quite reasonable (e.g., experimenting with two molecules of water). This assumption would clearly be invalid for two cholesterol patients. Or for any two people more generally.

Work-around III: Estimate causal effects for populations rather than individuals

For human subjects, neither (1) temporal stability and causal transience nor (2) unit homogeneity can plausibly be expected to hold in any setting. No two people are alike. And no one person is identical to him or herself at a different point in time.

We should therefore acknowledge that we will never be able to credibly estimate Ti = Y1i − Y0i for a person i. We might, however, be satisfied to settle for some kind of population average treatment effect instead: T∗ = E [Y1 − Y0|X = 1], where E [·] is the expectations operator, denoting the mean of a random variable. This expression above defines the Average Treatment Effect for the Treated (ATT), that is the causal effect of the treatment on the people who received the treatment (i.e., for whom X = 1).

The ATT should be distinguished from the Average Treatment Effect (ATE), defined as T† = E [Y1 − Y0]. The difference between T∗ and T† is this: ATT measures the causal effect only for those who receive treatment whereas the ATE is the causal effect one would notionally obtain if everyone were treated. These can be quite different. The ATT for a cholesterol lowering drug given to morbidly obese patients is probably not comparable to the ATE for a cholesterol lowering drug given to the entire population of adults.

Returning to our discussion of ATT, how do we estimate this quantity? Since we cannot directly observe T for any given individual i, how do we measure E [Y1 − Y0|X = 1] for some population of i′s? One idea: We could compare E [Y |X = 1] and E [Y |X = 0] to form ˜ T = E [Y |X = 1] − E [Y |X = 0]. For example, let X be the cholesterol treatment and Y be a measure of serum cholesterol level. We could compare cholesterol levels among those taking the treatment (E [Y |X = 1]) versus those not taking the treatment (E [Y |X = 0]) to estimate the causal effect of the treatment on cholesterol levels. Is this a good idea?

A moment’s thought should suggest that ˜ T is not a good estimator for T∗. The problem is that people who take the cholesterol treatment are likely to have abnormally high cholesterol whereas those who do not take the treatment are likely to have normal cholesterol levels. Thus, even if the treatment lowered cholesterol, we might erroneously conclude the opposite because our comparison group (X = 0) had low cholesterol to begin with whereas our treatment group (X = 1) had abnormally high cholesterol—and may still have above average cholesterol, even if the treatment lowered it somewhat.

So, if ˜ T is not a good measure of T∗, what would a good comparison look like? We need to find treatment and control populations that have the same expected levels of cholesterol but for the treatment. Formally, we want to identify a set of people for whom the counterfactual outcomes are comparable between the treatment and comparison (AKA control) groups.

Specifically:

E [Y1|X = 1] = E [Y1|X = 0]

E [Y0|X = 1] = E [Y0|X = 0]

(We denote this above equation as (1)).

These equalities imply that the treatment and control groups are ‘exchangeable.’ If we swapped treatment and control groups prior to the experiment, we’d estimate the same treatment effect as we’d get from the initial assignment. If these conditions are satisfied, then it’s straightforward to see that a contrast of the outcomes of the treatment and control groups will provide a valid estimate of the causal effect of treatment for the treated group. Specifically, E [Y1|X = 1] − E [Y0|X = 0] = E [Y1|X = 1] − E [Y0|X = 1] = T∗

Notice that our substitution above of E [Y0|X = 1] for E [Y0|X = 0] is justified by the assumption of treatment-control balance in equation. If the subjects who didn’t receive the treatment are just like those who did but for not having received the treatment, then the contrast between the treated and untreated groups provides an unbiased estimate of the causal effect of the treatment on the treated group (ATT).

Method: Difference-in-Difference Estimation

Often, we don’t simply measure the level of Y but it’s change as a function of X (the treatment) and time. For example, if we have a treatment and control group, we can form:

— Treatment Yjb (Before) Yja (After) ΔYj (Change)

— Control Ykb (Before) Yka (After) ΔYk (Change)

— where b stands for before and a stands for after

Why do we want to make a pre-post comparison? We actually do not need to do this if we have a very large population of (randomly assigned) treatment and control units to work with. In that case, we could simply calculate ˆ T = E [Y |X = 1] − E [Y |X = 0] = E [Y1 − Y0|X = 1] .

If X is randomly assigned and the population of treated units is large, then (1) should apply and hence the cross-sectional (as opposed to over-time) comparison should provide a valid estimate of the causal effect of interest. However, we often don’t have very large samples of treatment and control individuals to work with. Let’s say we are assessing the effect of a new drug treatment on cholesterol levels. We could pick 10 people each for the treatment and control groups, give the treatment group the drug treatment and the control group the placebo, and then compare the average cholesterol level between these two groups. There is nothing wrong with this approach. But we might be concerned that, just by chance, these two groups started out with somewhat different cholesterol levels.

Because of this concern, we could also take baseline data (prior to treatment) to ensure that these groups were comparable. Let’s say the baseline averages were comparable but not identical; by chance, the treatment group had a slightly lower cholesterol level than the treatment group. We’d be concerned that our experiment would be biased in favor of the finding that the treatment lowered cholesterol (since the treatment group started with a better outcome). It’s that concern that motivates us to compare the change in cholesterol in the treatment group to the change in cholesterol in the control group. By studying the change in the outcome variable, we subtract off initial differences in levels that could potentially prove confounding in small samples. Thus, we focus on the improvement (or change) in the treatment group relative to the control group.

Formally, let’s say that prior to treatment, we observe:

Yjb = αj

Ykb = αk

We would hope that αj ≃ αk, but this does not strictly have to be the case. Now, imagine that after treatment, we observe Yja = αj + δ + T, where T is the causal effect and δ is any effect of time. For example, cholesterol levels may tend to rise over time as people age. So, if we take the first difference for Yj , we get: ΔYj = Yja − Yjb = (αj − αj) + δj + T

This does not recover T. But it does remove the “level effect” αj. Similarly, ΔYk = (αk − αk) + δk. Differencing removes the level effect for group j. If we are willing to postulate that the time effect operates identically on the treatment and control groups, δj = δk = δ, then we have ΔYj − ΔYk = T + δ − δ = T.

So, the difference-in-difference estimator allows us to potentially recover the causal effect of treatment even when the treatment and control groups are not entirely identical and when there is a potentially confounding effect of time.