Quantifying Error in the Presence of Confounders for Causal Inference

  • 2019-07-10 15:53:07
  • Rathin Desai, Amit Sharma
  • 6

Abstract

Estimating average causal effect (ACE) is useful whenever we want to know theeffect of an intervention on a given outcome. In the absence of a randomizedexperiment, many methods such as stratification and inverse propensityweighting have been proposed to estimate ACE. However, it is hard to know whichmethod is optimal for a given dataset or which hyperparameters to use for achosen method. To this end, we provide a framework to characterize the loss ofa causal inference method against the true ACE, by framing causal inference asa representation learning problem. We show that many popular methods, includingback-door methods can be considered as weighting or representation learningalgorithms, and provide general error bounds for their causal estimates. Inaddition, we consider the case when unobserved variables can confound thecausal estimate and extend proposed bounds using principles of robuststatistics, considering confounding as contamination under the Hubercontamination model. These bounds are also estimable; as an example, we provideempirical bounds for the Inverse Propensity Weighting (IPW) estimator and showhow the bounds can be used to optimize the threshold of clipping extremepropensity scores. Our work provides a new way to reason about competingestimators, and opens up the potential of deriving new methods by minimizingthe proposed error bounds.

 

Quick Read (beta)

Quantifying Error in the Presence of Confounders for Causal Inference

Rathin Desai
Microsoft Research India
Bangalore, Karnataka
[email protected] &Amit Sharma
Microsoft Research India
Bangalore, Karnataka
[email protected]
Abstract

Estimating average causal effect (ACE) is useful whenever we want to know the effect of an intervention on a given outcome. In the absence of a randomized experiment, many methods such as stratification and inverse propensity weighting have been proposed to estimate ACE. However, it is hard to know which method is optimal for a given dataset or which hyperparameters to use for a chosen method. To this end, we provide a framework to characterize the loss of a causal inference method against the true ACE, by framing causal inference as a representation learning problem. We show that many popular methods, including back-door methods can be considered as weighting or representation learning algorithms, and provide general error bounds for their causal estimates. In addition, we consider the case when unobserved variables can confound the causal estimate and extend proposed bounds using principles of robust statistics, considering confounding as contamination under the Huber contamination model. These bounds are also estimable; as an example, we provide empirical bounds for the Inverse Propensity Weighting (IPW) estimator and show how the bounds can be used to optimize the threshold of clipping extreme propensity scores. Our work provides a new way to reason about competing estimators, and opens up the potential of deriving new methods by minimizing the proposed error bounds.

 

Quantifying Error in the Presence of Confounders for Causal Inference


  Rathin Desai Microsoft Research India Bangalore, Karnataka [email protected] Amit Sharma Microsoft Research India Bangalore, Karnataka [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Consider the canonical causal inference problem where the goal is to find the effect of a treatment T on some outcome Y, as shown in the structural causal model in Figure (a)a. This is relevant for estimating the effect of any fixed intervention, such as setting a system parameter, a medical intervention hernanbook or a policy in social science settings morgan2015counterfactuals . Here W and U are observed and unobserved common causes respectively, which affect the observed conditional distribution Pr(Y|T). To estimate the causal effect of T, methods typically condition on the observed common causes W using the “back-door” formula pearl2009book , including methods such as stratification lunceford2004stratification , matching rubin1996matching , and inverse weighting rosenbaum1983central . All of these methods work under the “ignorability" or the “selection on observables" assumption, where U is assumed to have no effect once we condition on W (i.e. Pr(Y|T,W)=Pr(Y|T,W,U)). In practice, however, ignorability is seldom satisfied and its violation can lead to significant errors, even changing the direction of the effect estimate. Because U is unobserved, current methods provide no bounds on the error in a causal effect estimate when the assumption is violated. This makes it hard to compare methods for a given dataset, or to assess sensitivity of an estimate to unobserved confounding, except by simplistic simulations of the effect of U rosenbaum2002observational .

In this paper, we provide a general framework for estimating error for causal inference methods, both in the presence and absence of U. Our insight is that the causal inference problem can be framed as a domain adaptation mansour2009domain problem, where the target distribution is generated from a (hypothetical) randomized experiment on T, as shown in Figure (b)b. Under this target distribution P, the observed effect P(Y|T) is the same as the causal effect, P(Y|do(T)) since T is no longer affected by W or U pearl2009book . The goal of causal inference then is to use data from a source distribution Q and estimate a function that approximates P(Y|T). Alternatively, one can consider this as a task of learning an intermediate distribution R (or a representation), such that R(Y|T) will be as close as possible to P(Y|T). In this paper, using the lens of domain adaptation mansour2009domain , we provide bounds on the error of such estimators for the average causal effect (ACE), based on distance (bias) of the intermediate distribution R from P and variance in estimating it. In particular, we show that many causal inference methods such as stratification and inverse propensity weighting (IPW) can be considered as learning an intermediate representation.

When U is ignorable, we provide bounds that separate out the effects of bias and variance in choosing R and derive a procedure to estimate them from data. Empirical simulations show the value of the proposed error bound in evaluating different intermediate representations, and correspondingly, causal inference algorithms. For instance, our bound can be used to select the optimal threshold for clipping extreme probabilities—a common technique in weighting algorithms such as IPW—in order to minimize error. When U is not ignorable, we utilize theory from robust estimators to characterize U’s effect on Y. The intuition is that confounding effect of U on Y can be considerd as contamination (noise) added to true function between T and Y. In addition, we assume that this noise affects only a fraction of input data rows. Such an assumption is plausible whenever effect of U is specific to certain units, for example, unobserved genes may only affect some people’s health outcome and be ignorable for other people. We use the Huber-contamination model huber1992robust to model this noise, provide a robust estimator for the causal effect lai2016agnostic , and bound its error under the assumption that U only affects a fraction of all outcomes Y. When such an assumption is not plausible, the bounds still allow us to study the sensitivity of the error as the amount of contamination (confounding) by U is changed. Overall, our error bounds on causal estimators provides a principled way to compare different estimators and conduct sensitivity analysis of causal estimates with minimal parametric assumptions.

(a) Source distribution Q
(b) Target Distribution P
(c) Source Distribution
(d) Target Distribution
Figure 1: Causal graphical models denoting source and target distributions in the presence and absence of unobserved confounders U.

2 Background & Contributions

2.1 Defining predictive and causal effect

We first define the average causal effect (ACE) and show its connection to the average predictive effect. Let V={W,T,Y} be the set of observed variables. T represents the treatment variable and Y the outcome variable. W represents the set of all observed common causes of T and Y, and U denotes the set of all unobserved common causes of T and Y. Throughout, we assume that the treatment is binary, T{0,1}, where T=1 denotes that a treatment was assigned and otherwise for T=0. Y and W can be discrete or continuous. U are unobserved common causes and we make no assumptions about them. Figure (a)a shows this observed data distribution as the source distribution Q, using the structural causal graph pearl2009book notation. Vertices represent variables and edges represent the potential causal link between these variables.

Under the source distribution Q, we define the average predictive effect (APE) of T on Y as:

APEQ=EQ[Y|T=1]-EQ[Y|T=0] (1)

Intuitively, APE captures the correlation between T and Y. In general, correlation is not a sufficient condition to imply that the treatment had actually caused the observed outcome. This is because Reichenbach’s common cause principle states that if two random variables T and Y are statistically dependent (TY), then there exists a third variable, say W that can causally influence both. Thus, using the do-operator pearl2009book , we can write the average causal effect of T on Y as,

ACE E[Y|do(T=1)]-E[Y|do(T=0)] (2)

where do(T=1) operator denotes setting the value of T=1 independent of all ancestors of T in the causal graph. A randomized experiment where one randomizes T and then observes effect on Y is one way for estimating the ACE. Due to randomization, any effect of W or U on T is wiped out and thus the method is considered as a “gold standard” for ACE. Effectively, randomization constructs a new distribution P where there are no back-door paths that confound effect of T on Y and thus average predictive effect equals ACE (formally, due to Rule-2 of do-calculus pearl2009book ). We call this the target distribution P and write:

APEP=EP[Y|T=1]-EP[Y|T=0]=E[Y|do(T=1)]-E[Y|do(T=0)]=ACE (3)

2.2 Causal inference methods

Without a randomized experiment, however, the ACE cannot be identified from observational data from Q. Methods for causal inference typically make the ignorability assumption, implying that U does not any additional effect after conditioning on the effect of W. That is, Pr(Y|T,W)=Pr(Y|T,W,U). In graphical language, conditioning on W “d-separates” T and Y in a modified graph that has no outgoing edges from T. Under this assumption, various methods have been proposed using the ideas of conditioning or weighting; for a review see hernanbook ; rosenbaum2002observational ; rubin1996matching .

In conditioning-based methods, we separate data into strata based on W, estimate the predictive effect in each stratum which is equal to the causal effect, and then use the back-door formula pearl2009book to aggregate the estimate. This method is called stratification (matching when each stratum is of size 1).

Q(Y|W,T=1)=P(Y|W,T=1)E(Y|do(T=1))=WEQ(Y|W,T=1)Q(W) (4)

Alternatively, in weighting-based methods, one can weight samples from the source data Q to resemble a sample from P. In other words, we ensure that the treatment assignment probability Q(T=1|W) matches the target distribution P as far as possible. This is achieved using importance sampling, or a common variant called inverse propensity weighting where each sample point’s weight is inversely proportional to its probability of occurence in the data. This weighting gives more weight to samples that do not occur frequently due to effect from W, thus compensating for selection bias in Q(T=1). Assuming n is the number of samples from Q, we write:

IPW^=1n(i=1nTi*YiQ(Ti=1|Wi)-i=1n(1-Ti)*Yi1-Q(Ti=1|Wi)) (5)

2.3 Our contributions

We make the following contributions:

  • Using the relationship between APE and ACE, we formulate causal inference as the problem of learning a representation R such that APER approximates APEP. Specifically, we use a probability weighting method to construct a representation R, and show that popular methods such as stratification and IPW are special cases of the weighting method. (Section 3)

  • We provide bounds for the loss in estimating ACE as APER and separate out the loss incurred due to bias and variance in selecting R. We apply these bounds to develop a data-driven method for selecting the clipping threshold of an IPW estimator. (Section 4)

  • When unobserved confounders U may be present, we extend these bounds using recent work in robust estimation and provide the first results that can characterize error in the presence of unobserved confounding. (Section 5)

3 Causal Inference as Representation Learning

As discussed above, the problem of estimating ACE can be considered as learning the target distribution P given data from Q and then estimating the observed conditional expectation EP[Y|T]. Q can be considered as the factual distribution, and P the counterfactual distribution corresponding to the question—what would have happened if we intervened on T without changing anything else? Our goal is to learn an intermediate distribution R that approximates P. This setup is similar to domain adaptation, except that instead of learning a function f as in Mansour et al. mansour2009domain , we learn a new representation of the data and estimate the same APE function.

3.1 Defining the weighting method

Given this formulation, a key question is how to generate a representation such that its APE will be close to ACE. We first define a consistent estimator for APE under any distribution R, h(xR). Then, the estimator (h), and the APE under infinite samples (h) can be written as:

h(xR)=E^R[Y|T=1]-E^R[Y|T=0]; h(xR)=ER[Y|T=1]-ER[Y|T=0] (6)

By the above definition of h, ACE=h(xP). Next, we define a class of distributions given by weighting of Q. Following Johansson et al. (johansson2018learning, ), we generate a weighted representation R from our source distribution Q such that h(xR) is an estimator for ACE.

Definition 3.1.

Let Q(W,T,Y) be the source distribution. We define a weighting function β(W,T) to generate a representation R such that,

β(W,T)=R(W|T)Q(W|T)=R(T|W)Q(T|W)

and that R is a valid probability distribution, W,TR(W,T)0, WTR(W,T)=1.

3.2 IPW and stratification as weighting methods

We now show that the IPW estimator and back-door methods such as stratification can be considered as a weighted β estimator.

Theorem 3.2.

Consider the causal graphical model in Figure (c)c where the observed common causes W are the only confounders. The IPW estimator can be written as a representation R where β(W,T=t)=R(W|T=t)Q(W|T=t)=Q(W)Q(W|T=t).

Proof.

Here we consider only the T=1 part of IPW estimator from the RHS of Equation 5. The proof is symmetric for T=0.

IPWT=1 =1n(i=1nTi*YiQ(Ti|Wi))=EQ[1T=1YQ(T=1|W)]=WEQ[Y|T=1,W]Q(T=1|W)Q(W)Q(T=1|W) (7)
=WYYQ(Y|T=1,W)Q(T=1|W)Q(W)Q(T=1|W)=WYYQ(Y|T=1,W)Q(W) (8)

where 1T=1 is an indicator function that is 1 whenever T is 1 and 0 otherwise. The second equality above utilized that T is binary.

Similarly, we can write the the first part (T=1) of the APE under R as:

APET=1R=YYR(Y|T=1)=WYYR(Y|T=1,W)R(W|T=1)=WYYQ(Y|T=1,W)R(W|T=1)

where the last equality is since Q(Y|T=1,W)=R(Y|T=1,W) (ignorability assumption from Equation 4). Further, using β(W,T)Q(W|T=1)=R(W|T=1) (by definition),

APET=1R==WYYQ(Y|T=1,W)β(W,T=1)Q(W|T=1)

Comparing the two terms for IPWT=1 and APET=1R, if β(W,T=1)=Q(W)Q(W|T=1), then IPWT=1=APET=1R. ∎

The above proof also shows the equivalence of IPW and backdoor-based stratification hernanbook . Under the conditions of Theorem 3.2, and using WEQ(Y|W,T=1)Q(W)=WYYQ(Y|T=1,W)Q(W), we have:

Corollary 3.2.1.

The stratification estimator from Equation 4, WEQ(Y|W,T=1)Q(W) is equivalent to Equation 8 and thus also a weighting method with β(W,T=1)=Q(W)Q(W|T=1).

4 Bounds for ACE without unobserved confounders

Let us first consider a setting where the latent confounder U has no effect on T or Y. That is, the treatment T is assigned to a unit according to only observed covariates W (shown in Figure (c)c).

Based on this assumption, we showed that a causal inference method can be characterized by a weighted distribution R that it outputs. We now provide error bounds based on a given distribution R. We use a setup similar to that of Mansour et al. mansour2009domain , where the loss function L is assumed to be symmetric and that it follows the triangle inequality. Common loss functions such as the L1 and L2 loss satisfy these properties. We are interested in the loss between an estimated effect h(xR) and the ACE, h(xP). If the loss function is assumed to be L1, the loss can be defined as: L(h(xR),h(xP))|h(xR)-h(xP)|

4.1 Loss Bound: A tradeoff between bias and variance

Before we state the loss bounds, we define two terms that characterize the loss. Intuitively, if R is chosen to be similar to Q (β1), then h(xR) will have low sample variance as the weights will be bounded, but high bias since h(xR) may be very different from the ACE, h(xP). Conversely, if we choose R to be close to P, then h(xR) will have low bias error, but possibly high variance as the β weights can be high. Thus, for any R, the error is a combination of these factors: bias in choosing R, and the variance in estimating h(xR).

To capture the error due to bias, we define a weighted L1 distance between R and P.

Definition 4.1.

(Weighted L1 Distance) Assume R and P are distributions over W,T,Y. We define the weighted L1 distance(WLD), between R,P as follows:

WLDT=t(R,P)=W(R(W|T=t)-P(W|T=t))EQ[Y|T=t,W] (9)

We also define a VR term due to variance in estimation.𝑉𝑅T=t=αT=t(Q^,β^)-αT=t(Q,β).

Definition 4.2.

(Sample Error Terms) Define

αT=t(Q^,β^)Wβ^Q^(W|T=t)YYR^(Y|T=t,W) (10)

Using the same notation, population α is defined as

αT=t(Q,β)WβQ(W|T=t)YYR(Y|T=t,W) (11)
Note 4.3.

The causal mechanism does not change across the distributions P,Q,R, which means, P(Y|T,W)=Q(Y|T,W)=R(Y|T,W)

For ease of exposition, we’ll assume the loss function is L1. We have the following result.

Theorem 4.4.

Assume that the loss function L is symmetric and obeys the triangle inequality. h is a function on a representation R such that h(xR)=ER[Y|T=1]-ER[Y|T=0]. Then, for any valid weighted representation R, if there are no unobserved confounders and and L=L1, then:

L(h(xR),h(xP)) |αT=1(Q^,β^)-αT=1(Q,β)|+|αT=0(Q^,β^)-αT=0(Q,β)|
+|WLDT=1(βQ,P)|+|WLDT=0(βQ,P)|

The proof is in Supplementary Materials.

4.2 Estimating the loss bound from observed data

Given a causal inference algorithm (as defined by its weights β), we now describe how to estimate these bounds from data.

Estimating VR term

For VR term, we use McDiarmid’s inequality raginsky2013concentration . We can rewrite αT=t as:

WβQ(W|T=t)YYR(Y|T=t,W)=WYβYQ(Y,W|T=t)=𝔼Q(Y,W|T=t)βY (12)

where we used that R(Y|T=t,W)=Q(Y|T=t,W). Thus, 𝑉𝑅T=t can be written as an expected value. Then estimated α^T=t can be written as 1NT=ti=0NT=tβi^Yi. Since g(X)=Yβ is a function of i.i.d samples X=(W,T,Y), we can apply the McDiarmid inequality,

Pr[g(Xn)-𝔼(g(Xn))t]1-exp(-2t2inci2)

where ci is the maximum change in g(Xn) after replacing Xi with another value Xi. We compute a data-dependent bound for each ci by considering all possible discrete values for Xi and computing the resultant difference in g. We provide the code to estimate ci in github/anonymizedcode.

Fixing the RHS as p, we obtain t=inci2log11-p2. Thus, we can estimate the difference 𝑉𝑅T=1 as

𝚆𝚒𝚝𝚑Pr=p |αT=1-α^T=1|inci2log11-p2 (13)

Estimating WLD

For some estimators like IPW, we can prove that they are unbiased and hence WLDT=t=0.

Lemma 4.5.

For the IPW estimator, if P(W)=Q(W)=R(W),L(h(xR),h(xP))=0.

Proof is in Supplementary Materials. For others, our estimation depends on assuming that Q(T=1|W) is bounded between [ρ,1-ρ] for some sufficiently small ρ. The intuition is that assignment of T depends on W, but for every W=w there is a minimum probability that T=1 or T=0. This assumption can be stated as “no extreme selection based on W” and is a generalization of the overlap assumption shalit2017estimating , a requirement for IPW and other causal inference methods. Under this assumption, WLD can be written as:

WLDT=t(R,P)=W(R(W|T=t)-P(W|T=t))EQ[Y|T=t,W] (14)
=WY(R(W|T=t)-P(W))YQ(Y|T=t,W) (15)
=WYYR(W|T=t)Q(Y|T=t,W)-WYYQ(W)Q(Y|T=t,W) (16)
=WYYβQ(Y,W|T=t)-WYYβ*Q(W|T=t)Q(Y|T=t,W) (17)

where the third equality is due to P(W|T=t)=P(W)=Q(W) and the fourth due to the definition of β from above. Here β corresponds to a causal inference method given by the representation R and β* corresponds to unbiased IPW weights, estimated by using IPW and then clipping propensity scores as min(ρ,Q^(T=1|W)) (assuming bounded Q(T=1|W)). The first term of Equation 17 can be estimated as as 𝔼Q(Yβ|T=1) and the second term as 𝔼Q(Yβ*|T=1). We show applications of estimating these bounds in Section 6.

5 Bounds for ACE with unobserved confounders

We now provide bounds for the general case of causal inference in the presence of unobserved confounders. Let V={W,T,Y,U}, where W,T,Y are the same as before, but U is introduced.

Our insight is that principles of robust statistics can be used to bound the loss due confounding by U. Let us consider the example from Section 1 where U are unobserved genes that affect the outcome Y as well as the choice of treatment T. In many cases, it can be reasonable to assume that U will affect the outcome Y for only a subset of the population (especially so when the outcome has discrete levels). Specifically, we make an assumption that U does not change the outcomes for all the units, instead only for a fraction of units η. This assumption can be written in terms of the Huber contamination model huber1992robust , where U’s effect is the contamination in observed Y. Formally, we can write,

Y(1-η)Q(Y|T,W)+ηQ(Y|T,W,U)

where η is the contaminated fraction of samples. Further, we assume U to be adversarial in nature as described in lai2016agnostic , i.e. U is allowed to observe values of W,T and change the value of Y accordingly.

Under these settings, we show that it is possible to bound L(h(xR),h(xP)) by estimating EWEY[Y|T,W] robustly and plugging in the additional error due to contamination. In effect, this amounts to a two-step procedure: learn a new representation QB robustly from distribution Q and then learn β(W,T) on this representation QB (i.e., weight QB to get R). In practice, since the bounds from Section 4 only depend on E[Y|T,W], we do not need to estimate QB but rather just a robust estimate of the conditional means for Y|T,W. Estimating EQB[Y|T,W] with a robust estimator implies removing the backdoor path as in Figure (b)b and thus, the error of the estimate can be bounded given a contamination fraction η. The proof proceeds in an analogous way to the previous Section; we next show an application of the bound by estimating error for the IPW estimator.

5.1 Bounds for IPW under unobserved confounding

Recall from Theorem 3.2, we have β(W,T)=Q(W)Q(W|T=1) for IPW. To provide a concrete bound, we use the robust mean estimator from Lai et al. lai2016agnostic for Y and assume that the fourth moment of Y is bounded, E((Y-μ)4|T,W)Cσ4 where σ is the standard deviation and C is some constant. We assume η fraction contamination (confounding due to U) and ϵ is a parameter for the running time of the robust mean algorithm.

Note 5.1.

Define γWQ(Wi)O(C1/4(η+ϵ)3/4σ)

Theorem 5.2.

Assume that the loss function L is symmetric and obeys the triangle inequality. h is a function on a representation R such that h(xR)=E[Y|T=1]-E[Y|T=0]. Then, for any valid weighted representation R,if Uϕ, the following holds with probability (1-1/poly(n))2|W|.

L(h(xR),h(xP))WLDT=1W(QB,P)+WLDT=0W(QB,P)+|αT=1(QB^,βB^)-αT=1(Q,βB)-γ|+|αT=0(QB^,βB^)-αT=0(QB,βB)-γ|

where QB is the “robust” version of the distribution Q. The proof is in Supplementary Materials.

Corollary 5.2.1.

For IPW estimator, if P(W)=Q(W)=R(W),L(h(xR),h(xP))=2γ

The proof is in Supplementary Materials. Note that depending on the nature of corruption and the adversary model, different robust estimation methods can be used which may provide tighter bounds.

6 Evaluation: Applying the loss bounds

We now evaluate our bounds on simulated data and describe their utility for choosing hyperparameters for causal inference. When there are unobserved confounders, we also propose a new method, robust IPW that relies on a robust estimator.

When U is ignorable (τ=κ=0).

We generate data using the following structural equations:

wjBinomial(p=0.5)j[1,|W|];uNormal(μ,σ)
t=Bernoulli(p=sigmoid(ψw+κu));y=Bernoulli(p=sigmoid(νw+τu+λt))

where ψ,νM and λ, τ, κ are scalar. T is always binary. We chose this formulation since generation of T maps directly to logistic regression, which makes it easy to estimate propensity scores when computing the causal estimate. The true ATE can be obtained by simulating ycounterfactual by setting t=1-t in the equation for y above and computing the average difference. We present results for |W|=5.

In Figure 2, we show that the bounds correctly follow the IPW estimate over different levels of confounding by W (values of ψ), and different sample sizes. Since IPW is unbiased, the bounds effectively estimate the variance of the estimator: as ψ increases, the error bound is expected to increase. The empirical error is the L1 distance between the actual IPW estimate and the true ATE. Across sample sizes and different values of ψ, we find that the proposed bound tracks the empirical error in the IPW estimate (Figure 2).

Figure 2: L1-error bound and IPW estimate for different levels of confounding by W.
Figure 3: Choosing the clipping threshold for IPW propensity that minimizes L1-error bound.

These bounds can have practical significance in choosing hyperparameters in causal inference methods. For instance, consider the popular technique of clipping extremely high propensity scores lee2011weight to reduce IPW variance. This introduces bias in the estimator, and an important question is how to select the clipping threshold. We estimate the loss bound for IPW under different values of the threshold (Figure 3). To estimate the WLD term, we generate treatment T such that the true probability is bounded between [ρ,1-ρ] where ρ=0.01 as discussed in Section 4. We find that the optimal (one that minimizes the loss bound) clipping threshold varies with sample size, marked by the dotted vertical line. Optimal threshold decreases with sample size: a higher threshold reduces the variance in smaller sample sizes, which is less required as sample size increases.

When U is not ignorable.

Finally we consider the setting when there are unobserved confounds. Based on the bounds, we propose a robust version of IPW using the estimator from lai2016agnostic and evaluate for a continuous Y. For a fixed contamination (η), our proposed robust IPW recovers the true estimate up to an error, and its error increases as η is increased. Critically, the variance in the estimator is substantially lower than the standard IPW estimator, but it is biased. At η=0.05, for a true causal effect of 1, we obtain an error of 0.2. Details are in the Supplementary Materials.

7 Related Work

Our work is related to domain adaptation and representation learning for causal inference. In the domain adaptation problem, the goal is to learn a function that generalizes from a source distribution Q to a target distribution P. Mansour et al. (mansour2009domain, ) provided bounds for generalization of a function between distributions and proposed weighting as a technique to minimize distance between source and target distributions. Gretton et al. gretton2009covariate and Kallus et al. kallus2016generalized have also proposed methods to learn weights from data samples so that the distance between the weighted source and the target is reduced. Weighting of the source distribution can be considered as learning a representation. Based on this idea, Johansson et al. (johansson2016learning, ) proposed a domain adaptation framework to learn a counterfactual distribution from the factual distribution. The estimated counterfactual distribution is then used to evaluate causal effect conditional on specific covariates, also known as the conditional average treatment effect (CATE).

For estimating ACE, there is a rich literature in statistics that proposes estimators based on the backdoor formula, including stratification, matching and propensity score-based methods like IPW (shah2005propensity, ). In the absence of latent confounders, error in estimating causal effect has been well studied for estimators like IPW rosenbaum1983central . For instance, estimators like Horvitz Thompson and Hajeck estimators henderson2013estimating provide us with a unbiased variance estimate for IPW. However, all of the above methods for CATE and ACE do not focus on producing general error bounds and assume that U is ignorable.

8 Conclusion

We have provided general error bounds for any causal estimator that can be written as a weighted representation learner. The error naturally decomposes into the sampling error in estimating R and measure of distance between the weighted distribution and target distribution P. The error terms also yield important insights for developing new methods by minimizing the error bounds.

References

  • (1) Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. M., and Schölkopf, B. Covariate shift by kernel mean matching.
  • (2) Henderson, T., Anakotta, T., et al. Estimating the variance of the horvitz-thompson estimator.
  • (3) Hernán, M., and Robins, J. Causal inference book, 2015.
  • (4) Huber, P. J. Robust estimation of a location parameter. In Breakthroughs in statistics. Springer, 1992, pp. 492–518.
  • (5) Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International Conference on Machine Learning (2016), pp. 3020–3029.
  • (6) Johansson, F. D., Kallus, N., Shalit, U., and Sontag, D. Learning weighted representations for generalization across designs. arXiv preprint arXiv:1802.08598 (2018).
  • (7) Kallus, N. Generalized optimal matching methods for causal inference. arXiv preprint arXiv:1612.08321 (2016).
  • (8) Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on (2016), IEEE, pp. 665–674.
  • (9) Lee, B. K., Lessler, J., and Stuart, E. A. Weight trimming and propensity score weighting. PloS one 6, 3 (2011), e18174.
  • (10) Lunceford, J. K., and Davidian, M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23, 19 (2004), 2937–2960.
  • (11) Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430 (2009).
  • (12) Morgan, S. L., and Winship, C. Counterfactuals and causal inference. Cambridge University Press, 2015.
  • (13) Pearl, J. Causality. Cambridge university press, 2009.
  • (14) Raginsky, M., Sason, I., et al. Concentration of measure inequalities in information theory, communications, and coding. Foundations and Trends® in Communications and Information Theory 10, 1-2 (2013), 1–246.
  • (15) Rosenbaum, P. R. Observational studies. In Observational studies. Springer, 2002, pp. 1–17.
  • (16) Rosenbaum, P. R., and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
  • (17) Rubin, D. B., and Thomas, N. Matching using estimated propensity scores: relating theory to practice. Biometrics (1996), 249–264.
  • (18) Shah, B. R., Laupacis, A., Hux, J. E., and Austin, P. C. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. Journal of clinical epidemiology 58, 6 (2005), 550–559.
  • (19) Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), JMLR. org, pp. 3076–3085.

9 Supplementary Materials

Appendix A Proof of Theorem 4.4

Definition A.1.

(Weighted L1 Distance) Assume R and P are distributions over W,T,Y. We define the weighted L1 distance (WLD), between R,P as follows:

WLDT=t(R,P)=W(R(W|T=t)-P(W|T=t))EQ[Y|T=t,W] (18)

And similarly, we define a VR term due to variance in estimation.VRT=t=αT=t(Q^,β^)-αT=t(Q,β).

Definition A.2.

(Sample Error Terms) Define

αT=t(Q^,β^)Wβ^Q^(W|T=t)YYR^(Y|T=t,W) (19)

Using the same notation, population α is defined as

αT=t(Q,β)WβQ(W|T=t)YYR(Y|T=t,W) (20)
Definition A.3.

Let Q(W,T,Y) be the source distribution. We define a weighting function β(W,T) to generate a representation R such that,

β(W,T)=R(W|T)Q(W|T)=R(T|W)Q(T|W)

and that R is a valid probability distribution, W,TR(W,T)0, WTR(W,T)=1.

Note A.4.

The causal mechanism does not change in the distributions P,Q,R, which means, P(Y|T,W)=Q(Y|T,W)=R(Y|T,W)

For ease of exposition, we’ll assume the loss function is L1. We have the following result.

Theorem A.5.

Assume that the loss function L is symmetric and obeys the triangle inequality. h is a function on a representation R such that h(xR)=ER[Y|T=1]-ER[Y|T=0]. Then, for any valid weighted representation R, if U=ϕ and L=L1, the following holds

L(h(xR),h(xP)) |αT=1(Q^,β^)-αT=1(Q,β)|+|αT=0(Q^,β^)-αT=0(Q,β)|
+|WLDT=1(βQ,P)|+|WLDT=0(βQ,P)|
Proof.

By Triangle Inequality,

L(h(xR),h(xP))L(h(xR),h(xR))+L(h(xR),h(xP)) (21)

PART I.

Consider the second term in the RHS, L(h(xR),h(xP))

=L((ER[Y|T=1]-ER[Y|T=0])-(EP[Y|T=1]-EP[Y|T=0]))
=L((ER[Y|T=1]-EP[Y|T=1])+(EP[Y|T=0]-ER[Y|T=0]))

Expanding on the first term, (ER[Y|T=1]-EP[Y|T=1])

=YY(R(Y|T=1)-P(Y|T=1))
=YYWR(Y|T=1,W)R(W|T=1)-P(Y|T=1,W)P(W|T=1)
=WYYR(Y|T=1,W)(R(W|T=1))-(P(W|T=1))ByA.4
=W(R(W|T=1)-P(W|T=1))YYR(Y|T=1,W)ByA.4
=W(R(W|T=1)-P(W|T=1))YYQ(Y|T=1,W)ByA.4

Now using the definition of Expectation and β(W,T) (Definition A.3)

=W(R(W|T=1)-P(W|T=1))EQ[Y|T=1,W]
=W(β(W,T=1)Q(W|T=1)-P(W|T=1))EQ[Y|T=1,W]
=WLDT=1(βQ,P)

Similarly expanding(ER[Y|T=0]-EP[Y|T=0]) (by symmetry)

W(β(W,T=0)Q(W|T=0)-P(W|T=0))EQY[Y|T=0,W]
=WLDT=0(βQ,P)

PART II

Now let us consider the first part of the RHS of Equation 21.

L(h(xR),h(xR))

=L((ER^[Y|T=1]-ER^[Y|T=0])-(ER[Y|T=1]-ER[Y|T=0]))
=L((ER^[Y|T=1]-ER[Y|T=1])+(ER[Y|T=0]-E^R[Y|T=0]))

Solving for (ER^[Y|T=1]-ER[Y|T=1])

=YY(R^(Y|T=1)-R(Y|T=1))
=YWYR^(Y|T=1,W)R^(W|T=1)-YR(Y|T=1,W)R(W|T=1)
=WYYR^(Y|T=1,W)R^(W|T=1)-WYYR(Y|T=1,W)R(W|T=1)
=WR^(W|T=1)YYR^(Y|T=1,W)-WR(W|T=1)YYR(Y|T=1,W)
=αT=1(Q^,β^)-αT=1(Q,β)

where the last equality follows from Definition A.2.

Similarly expanding E^R[Y|T=0]-ER[Y|T=0] (by symmetry)

=WR^(W|T=0)YYR^(Y|T=0,W)-WR(W|T=0)YYR(Y|T=0,W)

=αT=0(Q^,β^)-αT=0(Q,β)

PART III

Finally, we derive the result assuming Loss Function is L1.

L(h(xR),h(xR)) =L((ER^[Y|T=1]-ER[Y|T=1])+(ER[Y|T=0]-E^R[Y|T=0]))
=|(ER^[Y|T=1]-ER[Y|T=1])+(ER[Y|T=0]-E^R[Y|T=0])|
|ER^[Y|T=1]-ER[Y|T=1]|+|ER[Y|T=0]-E^R[Y|T=0]|
|αT=1(Q^,β^)-αT=1(Q,β)|+|αT=0(Q^,β^)-αT=0(Q,β)|
L(h(xR),h(xP)) =L((ER[Y|T=1]-EP[Y|T=1])+(EP[Y|T=0]-ER[Y|T=0]))
=|(ER[Y|T=1]-EP[Y|T=1])+(EP[Y|T=0]-ER[Y|T=0])|
|ER[Y|T=1]-EP[Y|T=1]|+|EP[Y|T=0]-ER[Y|T=0]|
|WLDT=1(βQ,P)|+|WLDT=0(βQ,P)|

Hence, we obtain the result:

L(h(xR),h(xP)) |αT=1(Q^,β^)-αT=1(Q,β)|+|αT=0(Q^,β^)-αT=0(Q,β)|
+|WLDT=1(βQ,P)|+|WLDT=0(βQ,P)|

Appendix B Proof of Lemma 4.5

Lemma B.1.

For the IPW estimator, if P(W)=Q(W)=R(W),L(h(xR),h(xP))=0.

Proof.

For the sake of proof, we’ll assume the loss function is L1. Since,

L(hR,hP|T=1) =L((ER[Y|T=1]-EP[Y|T=1])
=|ER[Y|T=1]-EP[Y|T=0]
w|R(W|T=1)ER[Y|T=1,W]-P(W|T=1)EP[Y|T=1,W]|
=w|βipw(Q,W)Q(W|T=1)EQ[Y|T=1,W]-P(W|T=1)EP[Y|T=1,W]|
w|Q(W)(EP[Y|T=1,W]+)-P(W|T=1)EP[Y|T=1,W]|
=w|(Q(W)-P(W))EP[Y|T=1,W]|
=0

A similar argument can be made for T=0 and hence, L(h(xR),L(xP))=0

Appendix C Proof of Theorem 5.2

Theorem C.1.

Assume that the loss function L is symmetric and obeys the triangle inequality. h is a function on a representation R such that h(xR)=E[Y|T=1]-E[Y|T=0]. Then, for any valid weighted representation R,if Uϕ, the following holds with probability (1-1/poly(n))2|W|.

L(h(xR),h(xP))WLDT=1W(QB,P)+WLDT=0W(QB,P)+|αT=1(QB^,βB^)-αT=1(Q,βB)-γ|+|αT=0(QB^,βB^)-αT=0(QB,βB)-γ|

Where, QB represents the ‘robust’ version of the distribution ‘Q’.

Proof.

For the sake of proof, we’ll assume the loss function is a L1. For IPW, βB(Q,W)=QB(W)QB(W|T=t) and R(W|T=t)=QB(W)

L(hR,hP) WLDT=1W(QB,P)+WLDT=0W(QB,P)FromA

From lai2016agnostic , we know with probability 1-1/poly(n), |μ^-μ|O(C41/4(η+ϵ)3/4σ). Consider, YYQB(Y|T=1,W=wi). This is evaluating conditional mean of Y robustly for a fixed value of wi.

For a fixed value of W=wi,T=t, with probability 1-1/poly(n)

Y¯Y¯^+O(C41/4(η+ϵ)3/4σ)

Since, QB(wi)>0,

QB(wi)Y¯QB(wi){Y¯^+O(C41/4(η+ϵ)3/4σ)}

Since, all instances of conditional mean of Y are independent, WQB(wi)Y¯WQB(wi){Y¯^+O(C41/4(η+ϵ)3/4σ)} with probability at least, (1-1/poly(n))|W|

L(hR,hR|T=t)W(Y¯^|W=wi,T=t)(QB(wi)-Q^B(wi))-WQ(Wi)O(C41/4(η+ϵ)3/4σ)

L(hR,hR)|h(xR)-hR(xR)|T=1+|h(xR),h(xR)|T=0)|
L(hR,hR)|αT=1(QB^,βB^)-αT=1(QB,βR)-γ|+|αT=0(QB^,βB^)-αT=0(QB,βB)-γ|

Appendix D Proof of Corollary 5.2.1

Corollary D.0.1.

For IPW estimator, if P(W)=Q(W)=R(W),L(h(xR),h(xP))=2γ

Proof.

For the sake of proof, we’ll assume the loss function is L1. Since,

L(hR,hP|T=1) =L((ER[Y|T=1]-EP[Y|T=1])
=|ER[Y|T=1]-EP[Y|T=0]
w|R(W|T=1)ER[Y|T=1,W]-P(W|T=1)EP[Y|T=1,W]|
=w|βipw(QB,W)QB(W|T=1)EQB[Y|T=1,W]-P(W|T=1)EP[Y|T=1,W]|
w|QB(W)(EP[Y|T=1,W]+γ)-P(W|T=1)EP[Y|T=1,W]|
=w|γQB(W)+(QB(W)-P(W))EP[Y|T=1,W]|
=w|γQB(W)|
=γ

A similar argument can be made for T=0 and hence, L(h(xR),L(xP))=2γ

Appendix E Results of Robust IPW

Setup.

We generate data using the following structural equations (assuming both W and U are unidimensional), and present results for the following set of parameters:

n=10000,α=0.0,β=0.01,ν=0.3,γ=1.0,δ=10.0 (22)
noiseyNormal(0,0.5)wBinomial(p=0.7) (23)
uNormal(μ=5.0,σ=1.0) (24)
t=Bernoulli(p=sigmoid(αw+βu))) (25)
y=νw+γt+δu+noisey (26)

We simulate the Huber contamination due to U as follows: with probability η, δ=δ, and with probability 1-η, δ=0.

Since γ=1, the true ACE (Average Causal Effect) is 1.0. The following table shows the robust IPW and standard IPW estimates over 10 different runs.

η Robust IPW (min,max) Standard IPW(min,max)
0.0 (0.979,1.011) (0.979,1.011)
0.05 (0.884,0.914) (0.860,1.377)
0.1 (0.793,0.831) (0.820,1.737)
0.15 (0.716,0.753) (0.709,1.887)
0.20 (0.646,0.692) (0.227,1.640)