Abstract
We address causal inference with text documents. For example, does adding atheorem to a paper affect its chance of acceptance? Does reporting the genderof a forum post author affect the popularity of the post? We estimate theseeffects from observational data, where they may be confounded by features ofthe text such as the subject or writing quality. Although the text suffices forcausal adjustment, it is prohibitively highdimensional. The challenge is tofind a lowdimensional text representation that can be used in causalinference. A key insight is that causal adjustment requires only the aspects oftext that are predictive of both the treatment and outcome. Our proposed methodadapts deep language models to learn lowdimensional embeddings from text thatpredict these values well; these embeddings suffice for causal adjustment. Weestablish theoretical properties of this method. We study it empirically onsemisimulated and real data on paper acceptance and forum post popularity.Code is available at https://github.com/bleilab/causaltextembeddings.
Quick Read (beta)
Using Text Embeddings for Causal Inference
Abstract
We address causal inference with text documents. For example, does adding a theorem to a paper affect its chance of acceptance? Does reporting the gender of a forum post author affect the popularity of the post? We estimate these effects from observational data, where they may be confounded by features of the text such as the subject or writing quality. Although the text suffices for causal adjustment, it is prohibitively highdimensional. The challenge is to find a lowdimensional text representation that can be used in causal inference. A key insight is that causal adjustment requires only the aspects of text that are predictive of both the treatment and outcome. Our proposed method adapts deep language models to learn lowdimensional embeddings from text that predict these values well; these embeddings suffice for causal adjustment. We establish theoretical properties of this method. We study it empirically on semisimulated and real data on paper acceptance and forum post popularity. Code is available at github.com/bleilab/causaltextembeddings.
fontsize= \crefnamelemmalemmalemmas \CrefnamelemmaLemmaLemmas \crefnamethmtheoremtheorems \CrefnamethmTheoremTheorems \crefnameproppropositionpropositions \CrefnamepropPropositionPropositions \crefnameassumptionassumptionassumptions \crefnameassumptionAssumptionAssumptions \newbibmacro*journal \addbibresourcebibs/language,bibs/causality \crefformatequation(#2#1#3) \crefformatfigureFigure #2#1#3 \crefnameexampleExampleExamples \crefnamelemmaLemmaLemmas \crefnamecorCorollaryCorollaries \crefnametheoremTheoremTheorems \crefnameassumptionAssumptionAssumptions
1 Introduction
We develop a method for causal inference from observed text documents. We consider a binary treatment, an outcome of interest, and a document of text. We assume that the text carries sufficient information to identify the causal effect; it is either an observed confounder or an observed mediator.
Example 1.1.
Consider a corpus of scientific papers submitted to a conference. Some have theorems; others do not. We want to infer the causal effect of including a theorem on paper acceptance. The effect is confounded by the subject of the paper—more technical topics demand theorems, but may have different rates of acceptance. The data does not explicitly list the subject, but it does include each paper’s abstract. We want to use the text to adjust for the subject and estimate the causal effect.
Example 1.2.
Consider comments from Reddit.com, an online forum. Each post has a popularity score and the author of the post may (optionally) list their gender. We want to know the direct effect of a ‘male’ label on the score of the post. However, the author’s gender may affect the text of the post, e.g., through tone, style, or topic choices, which also affects its score. Again, we want to use the text to accurately estimate the causal effect.
In these two examples, we assume that the text carries sufficient information to identify the causal effect. In theory, we can use classical methods of causal inference to adjust for the text of the document. But in practice we have finite data and the text is high dimensional, prohibiting efficient and accurate causal inference. The challenge is to reduce the text to a lowdimensional representation that both suffices for causal identification and that allows effective estimation with finite data.
Our strategy is to draw on text embedding methods to reduce the dimension of the text \citep[e.g.,][]Mikolov:Chen:Corrado:Dean:2013,Mikolov:Sutskever:Chen:Corrado:2013,Devlin:Chang:Lee:Toutanova:2018,Peters:Neumann:Iyyer:Gardner:Clark:Lee:Zettlemoyer:2018. Informally, a text embedding method distills the text of each document to a realvalued vector, and these embeddings can be used as features for prediction problems. Blackbox embedding methods are stateoftheart for a range of natural language understanding tasks [Devlin:Chang:Lee:Toutanova:2018, Peters:Neumann:Iyyer:Gardner:Clark:Lee:Zettlemoyer:2018]. Here, we will adapt embedding methods in the service of causal inference.
The key insight is that to adjust for variables in causal inference, it suffices to use only the information relevant to the prediction of the treatment and outcome. Thus we harness modern embedding methods—BERT [Devlin:Chang:Lee:Toutanova:2018], in particular—to extract the information from the text required for this prediction problem . The learned embeddings capture information sufficient for causal identification and provide the necessary ingredients for various causal estimators.
Contribution. The main contribution of this paper is a method for adapting offtheshelf text embedding methods to estimate treatment effects. We show that the method is theoretically sound, demonstrate its utility on semisynthetic data, and apply it to real datasets for estimating causal effects of the properties of papers on acceptance and gender label on popularity.
2 Related work.
This paper connects to several areas of related work.
The first area is causal inference for text. \citetroberts2018adjusting also discuss how to estimate effects of treatments applied to text documents. They rely (in part) on topic modeling to reduce the dimension of the text. This strategy is reasonable if the learned topics reflect the confounding aspects of the text. In contrast, we replace the assumption that the topics capture confounding with the assumption that an embedding method can effectively extract predictive information. We compare to a topicmodel based approach in creftype 5.
In other work, \citetegami2018make reduce raw text to interpretable outcomes; \citetwood2018challenges estimate treatment effects when confounders are observed, but missing or noisy treatments are inferred from text. In contrast, we are concerned with text as the confounder.
A second area of related work addresses causal inference with unobserved confounding when there is an observed proxy for the confounder [Kuroki:Miyakawa:1999, Pearl:2012, Kuroki:Pearl:2014, Miao:Geng:TchetgenTchetgen:2018, kallus2018causal]. This work usually assumes that the observed proxy variables are noisy realizations of the unobserved confounder, and then derives conditions under which causal identification is possible. One view of our problem is that each unit has a latent attribute (e.g., topic) such that observing it would suffice for causal identification, and the text is a proxy for this attribute. Unlike the proxy variable approach, however, we assume the text fully captures confounding. Our interest is in methods for finitesample estimation rather than infinitedata identification.
Louizos:Shalit:Mooij:Sontag:Zemel:Welling:2017 also work with proxy variables, and consider the estimation problem. They fit a variational autoencoder using observed data and assume that it exactly recovers the true data generating distribution (including the latent confounder). We require weaker assumptions than the full recovery of the data generating distribution.
Work on causal inference with hidden confounding and many treatments is in the same vein [Wang:Blei:2018, Ranganath:Perotte:2018, damour:2019]. The idea is to use the treatments to infer the latent confounders. In contrast, we assume that the text suffices to adjust for confounding .
Finally, \citetVeitch:Wang:Blei:2019 also use the reduction of causal estimation to prediction. In their case, to address unobserved confounding in the presence of network data.
3 Background
We begin by fixing notation and recalling some ideas from the estimation of causal effects. Each statistical unit is a document represented as a tuple ${O}_{i}=({Y}_{i},{T}_{i},{\mathbf{W}}_{i})$, where ${Y}_{i}$ is the outcome, ${T}_{i}$ is the treatment, and ${\mathbf{W}}_{i}$ is the sequence of words. The observed dataset consists of $n$ observations drawn independently and identically at random from some distribution, ${O}_{i}\sim P$.
We review estimation of the average treatment effect and the natural direct effect. For both, we assume that the words are sufficient for adjustment.
Average treatment effect. The average treatment effect (ATE) is defined as
$$\psi =\mathbb{E}[Y\mathrm{do}(T=1)]\mathbb{E}[Y\mathrm{do}(T=0)].$$ 
The use of Pearl’s $\mathrm{do}$ notation indicates that the effect of interest is causal: what happens if we intervene by adding a theorem to a paper? We assume that the words ${\mathbf{W}}_{i}$ carry sufficient information to adjust for confounding (common causes) between ${T}_{i}$ and ${Y}_{i}$. creftypecap 1 on the left depicts this assumption. We define ${Z}_{i}=f({\mathbf{W}}_{i})$ to be the part of ${\mathbf{W}}_{i}$ which blocks all ‘backdoor paths’ between ${Y}_{i}$ and ${T}_{i}$. The causal effect is then identifiable from observational data as:
$$\psi =\mathbb{E}[\mathbb{E}[YZ,T=1]\mathbb{E}[YZ,T=0]].$$  (3.1) 
Our task is to estimate the ATE $\psi $ from a finite data sample. Define $Q(t,z)=\mathbb{E}[Yt,z]$ to be the conditional expected outcome and $\widehat{Q}$ to be an estimate for $Q$. Following 3.1, a natural estimator is:
$${\widehat{\psi}}^{Q}=\frac{1}{n}\sum _{i}\left[\widehat{Q}(1,{z}_{i})\widehat{Q}(0,{z}_{i})\right].$$  (3.2) 
That is, $\psi $ is estimated by a twostage procedure: First produce an estimate for $\widehat{Q}$ through a predictive model; then plug $\widehat{Q}$ into a predetermined statistic to compute the estimate of the ATE.
The estimator creftype 3.2 is not the only possible choice. In principle, it is possible to do better by using estimators that also incorporate estimates $\widehat{g}$ of the propensity scores $g(z)=\mathrm{P}(T=1z)$ \citep[e.g.,][]Robins:2000,vanderLaan:Rose:2011,Robins:Rotnitzky:Zhao:1994,Chernozhukov:Chetverikov:Demirer:Duflo:Hansen:Newey:Robins:2017. The general approach is a twostage procedure. First fit a model for propensity scores and conditional outcomes; then plug the fitted model into a downstream estimator. What is important is that these estimators depend on ${z}_{i}$ only through $\widehat{g}({z}_{i})$ and $\widehat{Q}(t,{z}_{i})$.
Natural direct effect. The direct effect is the expected change in outcome if we apply the treatment while holding fixed any mediating variables that are affected by the treatment and that affect the outcome. creftypecap 1 on the right depicts the text as mediator of the treatment and outcome. For the estimation of the direct effect, we take $Z=f(\mathbf{W})$ to be the parts of ${\mathbf{W}}_{i}$ that mediate $T$ and $Y$. The natural direct effect of treatment $\beta $ is average difference in outcome induced by giving each unit the treatment, if the distribution of $Z$ had been as though each unit received treatment. That is,
$$\beta ={\mathbb{E}}_{\mathrm{P}(ZT=1)}[\mathbb{E}[YZ,\mathrm{do}(T=1)]\mathbb{E}[YZ,\mathrm{do}(T=0)]].$$ 
In the gender example, this is the expected difference in score between a post labeled as written by a man versus labeled as written by a woman, where the expectation is taken over the distribution of posts written by men.
Under minimal conditions, this quantity may be estimated from observational data [Pearl:2014]. The natural estimator is \citep[][Ch. 8]vanderLaan:Rose:2011
$${\widehat{\beta}}^{\mathrm{plugin}}=\frac{1}{n}\sum _{i}\left[\widehat{Q}(1,{z}_{i})\widehat{Q}(0,{z}_{i})\right]\widehat{g}({z}_{i})/\left(\frac{1}{n}{\sum}_{i}{t}_{i}\right).$$ 
As with the ATE, there are also more sophisticated estimators \citep[e.g.,][Ch. 8]vanderLaan:Rose:2011. Again, all such estimators rely on $Z$ only through the estimated conditional outcomes and propensity scores.
4 Causal text embeddings
We first focus on estimation of the average treatment effect. Following the previous section, we want to produce estimates of the propensity score $g({z}_{i})$ and the conditional expected outcome $Q({t}_{i},{z}_{i})$. We assume that some property ${z}_{i}=f({\mathbf{w}}_{i})$ of the text suffices for identification. The obstacle motivating this paper is that we do not directly observe the confounding features ${z}_{i}$. Instead, we must work with the raw text.
A simple approach is to abandon ${z}_{i}$ altogether and learn models for the propensities and conditional outcomes directly from the words ${\mathbf{w}}_{i}$. Since ${\mathbf{w}}_{i}$ contains all information about ${z}_{i}$, the direct adjustment will also render the causal effect identifiable. Indeed, in an infinitedata setting this would be a sound approach. However, the dimensionality of the problem is prohibitive.
We require a reduction of the words ${\mathbf{w}}_{i}$ to a feature ${z}_{i}$ that both contains sufficient information to render the causal effect identifiable, and that will allow us to effectively learn the propensity scores and conditional outcomes with a finite data sample. A key insight follows from \citep[][Thm. 3]Rosenbaum:Rubin:1983. Recall $Q(t,z)=\mathbb{E}[Yt,z]$ and $g(z)=\mathrm{P}(T=1z)$.
Theorem 4.1.
Suppose $\lambda \mathit{}\mathrm{(}\mathrm{w}\mathrm{)}$ is some function of the words such that at least one of the following is $\lambda \mathit{}\mathrm{(}\mathrm{W}\mathrm{)}$measurable:

1.
$(Q(1,\mathbf{W}),Q(1,\mathbf{W}))$,

2.
$g(\mathbf{W})$,

3.
$g((Q(1,\mathbf{W}),Q(1,\mathbf{W})))$ or $(Q(1,g(\mathbf{W})),Q(1,g(\mathbf{W})))$.
If adjusting for $\mathrm{W}$ suffices to render the average treatment effect identifiable then adjusting for only $\lambda \mathit{}\mathrm{(}\mathrm{W}\mathrm{)}$ also suffices. That is, $\psi \mathrm{=}\mathrm{E}\mathrm{[}\mathrm{E}\mathrm{[}Y\mathrm{}\lambda \mathrm{(}\mathrm{W}\mathrm{)}\mathrm{,}T\mathrm{=}\mathrm{1}\mathrm{]}\mathrm{}\mathrm{E}\mathrm{[}Y\mathrm{}\lambda \mathrm{(}\mathrm{W}\mathrm{)}\mathrm{,}T\mathrm{=}\mathrm{0}\mathrm{]}\mathrm{]}$.
In words: the random variable $\lambda (\mathbf{W})$ carries the information about $\mathbf{W}$ relevant to the prediction of both the propensity score and the conditional expected outcome. While $\lambda (\mathbf{W})$ will typically throw away much of the information in the words, creftype 4.1 says that adjusting for it suffices to estimate causal effects. Item 3 says that this holds even if we throw away information relevant to $Y$, so long as this information is not also relevant to $T$ (and vice versa). The utility of creftype 4.1 is that if we can find features of $\mathbf{w}$ that suffice for the prediction problem, then adjusting for these features also suffices for the causal estimation problem.
Our strategy is to use the words of each document to produce an embedding vector $\lambda (\mathbf{w})$ that captures the confounding aspects of the text. These embeddings are satisfactory if we can use them to estimate the propensities and conditional outcomes required by the downstream effect estimator.
We will use embeddingbased prediction models from the natural language processing literature. For our purposes, these models may viewed as blackboxes that take in words ${\mathbf{w}}_{i}$ and produce a tuple $({\lambda}_{i},\stackrel{~}{Q}({t}_{i},{\lambda}_{i}),\stackrel{~}{g}({\lambda}_{i}))$, which contains an embedding ${\lambda}_{i}$ and estimates of $g$ and $Q$ that use that embedding. The idea is that such models provide an effective blackbox tool for both distilling the words into the information relevant to prediction problems, and for solving those prediction problems.
Finally, to estimate the average treatment effect, we follow the general strategy of creftype 3. First, we fit the embeddingbased prediction model to produce estimated embeddings ${\widehat{\lambda}}_{i}$, propensity scores $\stackrel{~}{g}({\widehat{\lambda}}_{i})$ and conditional outcomes $\stackrel{~}{Q}({t}_{i},{\widehat{\lambda}}_{i})$. We then plug these values into a downstream estimator. We will see an explicit example below.
Validity. The next result gives conditions for this procedure to be valid.
Theorem 4.2.
Let $\eta \mathrm{(}z\mathrm{)}\mathrm{=}\mathrm{(}\mathrm{E}\mathrm{[}Y\mathrm{}T\mathrm{=}\mathrm{0}\mathrm{,}z\mathrm{]}\mathrm{,}\mathrm{E}\mathrm{[}Y\mathrm{}T\mathrm{=}\mathrm{1}\mathrm{,}z\mathrm{]}\mathrm{,}\mathrm{P}\mathrm{[}T\mathrm{=}\mathrm{1}\mathrm{}z\mathrm{)}\mathrm{)}$ be the conditional outcomes and propensities given $z$. Suppose that $\widehat{\psi}\mathit{}\mathrm{(}\mathrm{\{}\mathrm{(}{t}_{i}\mathrm{,}{y}_{i}\mathrm{,}{z}_{i}\mathrm{)}\mathrm{\}}\mathrm{;}\eta \mathrm{)}\mathrm{=}\frac{\mathrm{1}}{n}\mathit{}{\mathrm{\sum}}_{i}\varphi \mathit{}\mathrm{(}{t}_{i}\mathrm{,}{y}_{i}\mathrm{,}\eta \mathit{}\mathrm{(}{z}_{i}\mathrm{)}\mathrm{)}\mathrm{+}{o}_{p}\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$ is some consistent estimator for the average treatment effect $\psi $. Further suppose that there is some function $\lambda $ of the words such that

1.
(identification) $\lambda $ satisfies the condition of creftype 4.1.

2.
(consistency) ${\parallel \eta (\lambda ({\mathbf{W}}_{i}))\stackrel{~}{\eta}({\widehat{\lambda}}_{i})\parallel}_{P,2}\to 0$ as $n\to \mathrm{\infty}$, where $\stackrel{~}{\eta}$ is the estimated conditional outcome and propensity model.

3.
(wellbehaved estimator) ${\parallel {\nabla}_{\eta}\varphi (t,y,\eta )\parallel}_{2}\le C$ for some constant $C\in {\mathbb{R}}_{+}$,
then, $\stackrel{\mathrm{~}}{\psi}\mathit{}\mathrm{(}\mathrm{\{}\mathrm{(}{t}_{i}\mathrm{,}{y}_{i}\mathrm{,}{\widehat{\lambda}}_{i}\mathrm{)}\mathrm{\}}\mathrm{;}\stackrel{\mathrm{~}}{\eta}\mathrm{)}\stackrel{\mathit{p}}{\mathrm{\to}}\psi $.
Remark 4.3.
The requirement that the estimator $\widehat{\psi}$ behaves asymptotically as a sample mean is not an important restriction; most commonly used estimators have this property [Kennedy:2016]. The third condition is a technical requirement on the estimator. In the cases we consider, it suffices that the range of $Y$ and $Q$ are bounded and that $g$ is bounded away from 0 and 1. This later requirement is the common ‘overlap’ condition, and is anyway required for the estimation of the causal effects.
Proof.
By creftype 4.1 and assumption 1, $\widehat{\psi}(\{({t}_{i},{y}_{i},\lambda ({\mathbf{w}}_{i}))\};\eta )\stackrel{\mathit{p}}{\to}\psi $.
For brevity, we write ${\lambda}_{i}=\lambda ({\mathbf{w}}_{i})$. By Taylor’s theorem,
$$\frac{1}{n}\sum _{i}\varphi ({t}_{i},{y}_{i},\stackrel{~}{\eta}({\widehat{\lambda}}_{i}))=\frac{1}{n}\sum _{i}\varphi ({t}_{i},{y}_{i},\eta ({\lambda}_{i}))+\frac{1}{n}\sum _{i}{\nabla}_{\eta}\varphi ({t}_{i},{y}_{i},{\eta}_{i}^{*})(\stackrel{~}{\eta}({\widehat{\lambda}}_{i})\eta ({\lambda}_{i})),$$ 
for some $\{{\eta}_{i}^{*}\}$. By continuous mapping, it suffices to show that the second term goes to 0 in probability. By CauchySchwarz and assumption 3,
$$\frac{1}{n}\sum _{i}{\nabla}_{\eta}\varphi ({t}_{i},{y}_{i},{\eta}_{i}^{*})(\stackrel{~}{\eta}({\widehat{\lambda}}_{i})\eta ({\lambda}_{i}))\le C\sqrt{\frac{1}{n}\sum _{i}{\parallel \stackrel{~}{\eta}({\widehat{\lambda}}_{i})\eta ({\lambda}_{i})\parallel}_{2}^{2}}.$$ 
By Markov’s inequality, $\mathrm{P}(\frac{1}{n}{\sum}_{i}\parallel \stackrel{~}{\eta}({\widehat{\lambda}}_{i})\eta ({\lambda}_{i}){\parallel}_{2}^{2}>\epsilon )\le \parallel \eta ({\lambda}_{i})\stackrel{~}{\eta}({\widehat{\lambda}}_{i}){\parallel}_{P,2}^{2}/\epsilon ,$ for all $\epsilon >0$. The result follows by assumption 2. ∎
As with all causal inference, the validity of the procedure relies on uncheckable assumptions that the practitioner must assess on a casebycase basis. Particularly, we require that:

1.
(properties $z$ of) the document text renders the effect identifiable,

2.
the embedding method extracts text information relevant to the prediction of both $t$ and $y$,

3.
the conditional outcome and propensity score models are consistent.
Only the second assumption is nonstandard. In practice, we use the best possible embedding method and take the strong performance on (predictive) natural language tasks in many contexts as evidence that the method effectively extracts information relevant to prediction tasks. Implicitly, we are assuming that features that are useful for language understanding tasks are also useful for eliminating confounding. This is reasonable in settings where we expect the confounding to be aspects such as topic, writing quality, or sentiment. Informally, assumption 2 is satisfied if we use a good naturallanguage model, so we satisfy it by using the best available model.
Causal BERT. We modify BERT, a stateoftheart language model \citepDevlin:Chang:Lee:Toutanova:2018. Each input to BERT is the document text, a sequence of wordpiece tokens ${\mathbf{w}}_{i}=({w}_{i1},\mathrm{\dots},{w}_{il})$. The model is tasked with producing three kinds of outputs: 1) documentlevel embeddings, 2) a map from the embeddings to treatment probability, 3) a map from the embeddings to expected outcomes for the treated and untreated.
The model assigns an embedding ${\xi}_{w}$ to each wordpiece $w$. It then produces a documentlevel embedding for document text ${\mathbf{w}}_{i}$ as ${\lambda}_{i}=f(({\xi}_{{w}_{i1}},\mathrm{\dots},{\xi}_{{w}_{il}}),{\gamma}^{\text{U}})$ for a particular function $f$. The embeddings and global parameter ${\gamma}^{\text{U}}$ are trained by minimizing an unsupervised objective, denoted as ${L}_{\text{U}}({\mathbf{w}}_{i};\xi ,{\gamma}^{\text{U}})$. Informally, random wordpiece tokens are censored from each document and the model is tasked with predicting their identities.^{1}^{1} 1 BERT also considers a ‘next sentence’ prediction task, which we do not use.
Following \citetDevlin:Chang:Lee:Toutanova:2018, we use a finetuning approach to solve the prediction problem. We add a logitlinear layer mapping ${\lambda}_{i}\to \stackrel{~}{g}({\lambda}_{i};{\gamma}^{g})$ and a 2hidden layer neural net for each of ${\lambda}_{i}\to \stackrel{~}{Q}(0,{\lambda}_{i};{\gamma}^{{Q}_{0}})$ and ${\lambda}_{i}\to \stackrel{~}{Q}(1,{\lambda}_{i};{\gamma}^{{Q}_{1}})$. We learn the parameters for the embedding model and the prediction model jointly. Intuitively, this adapts the embeddings to be useful for the downstream prediction task, i.e., for causal inference.
We write $\gamma $ for the full collection of global parameters. The final model is trained as:
${\widehat{\lambda}}_{i}$  $=f(({\widehat{\xi}}_{n,{w}_{i1}},\mathrm{\dots},{\widehat{\xi}}_{n,{w}_{il}}),{\widehat{\gamma}}^{\text{U}})$  
$\widehat{\xi},\widehat{\gamma}$  $=\underset{\xi ,\gamma}{argmin}{\displaystyle \frac{1}{n}}{\displaystyle \sum _{i}}L({\mathbf{w}}_{i};\xi ,\gamma ),$ 
where the objective is designed to predict both the treatment and outcome. It is
$$L({\mathbf{w}}_{i};\xi ,\gamma )={\left({y}_{i}\stackrel{~}{Q}({t}_{i},{\lambda}_{i};\gamma )\right)}^{2}+\mathrm{\U0001d5a2\U0001d5cb\U0001d5c8\U0001d5cc\U0001d5cc\U0001d5a4\U0001d5c7\U0001d5cd}({t}_{i},\stackrel{~}{g}({\lambda}_{i};\gamma ))+{L}_{\text{U}}({\mathbf{w}}_{i};\xi ,\gamma ).$$ 
Effect estimation. Computing causal effect estimates simply requires plugging in the propensity scores and expected outcomes that the trained model predicts on the heldout units. For example, using the plugin estimator creftype 3.2,
$${\widehat{\psi}}^{Q}:=\frac{1}{n}\sum _{i}\stackrel{~}{Q}(1,{\widehat{\lambda}}_{n,i};{\widehat{\gamma}}_{n}^{Q})\stackrel{~}{Q}(0,{\widehat{\lambda}}_{n,i};{\widehat{\gamma}}_{n}^{Q}).$$  (4.1) 
The same procedure applies to other estimators as well.
Natural direct effect. We now discuss the analogous development for the natural direct effect. In this setting, the text serve as mediators between the treatment and the outcome. We are interested in understanding the causal effect of the treatment that does not go through the text.
The key result is the analogue of creftype 4.1. Namely, suppose $\lambda $ is some function of the words such that $\lambda (\mathbf{W})$ carries all information relevant to both the prediction of the treatment and outcome. Then the natural direct effect is equal to
$$\beta ={\mathbb{E}}_{\mathrm{P}(\lambda (\mathbf{W})T=1)}[\mathbb{E}[Y\lambda (\mathbf{W}),\mathrm{do}(T=1)]\mathbb{E}[Y\lambda (\mathbf{W}),\mathrm{do}(T=0)]].$$  (4.2) 
That is, adjusting for $\lambda (\mathbf{W})$ suffices to adjust for any mediating effect in the words. This result is essentially by definition: any mediator must be predictive of both the treatment and outcome, so it suffices to adjust only for the parts of $\mathbf{w}$ that are predictive of both treatment and outcome.
The remaining development is identical to the average treatment effect case. We estimate embeddings, propensities, and conditional expected outcomes using Causal BERT, and then plug these estimates into a downstream direct effect estimator. For example,
$${\widehat{\beta}}^{\mathrm{plugin}}=\frac{1}{n}\sum _{i}\left[\stackrel{~}{Q}(1,{\widehat{\lambda}}_{i})\stackrel{~}{Q}(0,{\widehat{\lambda}}_{i})\right]\stackrel{~}{g}({\widehat{\lambda}}_{i})/(\frac{1}{n}\sum _{i}{t}_{i}).$$  (4.3) 
The proof of validity is the same as creftype 4.2.
5 Experiments
We now empirically study the quality of Causal BERT embeddings for causal estimation. The questions of interest are: 1) do the learned embeddings identify causal effects in realistic simulations? 2) what happens in the presence of unobserved confounding exogenous to the text? Additionally, we apply the proposed method to the two motivating examples in the introduction. We estimate causal effects on paper acceptance and post popularity on Reddit.com.^{2}^{2} 2 Software and data at github.com/bleilab/causaltextembeddings.
We find: 1) The method is able to effectively adjust for confounding. And, 2) it is robust to exogenous confounding. Our application suggests that much of the apparent effect of the treatments we study is attributable to confounding in the text.
5.1 Setup
PeerRead. PeerRead is a corpus of computerscience papers [Kang:Ammar:Dalvi:vanZuylen:Kohlmeier:Hovy:Schwartz:2018]. We consider a subset of the corpus consisting of papers posted to the arXiv under cs.cl, cs.lg, or cs.ai between 2007 and 2017 inclusive. The data only includes papers which are not cross listed with any noncs categories and are within a month of the submission deadline for a target conferences. The conferences are: ACL, EMNLP, NAACL, EACL, TACL, NeurIPS, ICML, ICLR and AAAI. A paper is marked as accepted if it appeared in one of the target venues. Otherwise, the paper is marked as rejected. The dataset includes 11,778 papers, of which 2,891 are accepted.
For each paper, we consider the text of abstract, the accept/reject decision, and two attributes:

1.
buzzy: the title contains any of ‘deep’, ‘neural’, ‘embed’, or ‘adversarial net’.

2.
theorem: the word ‘Theorem’ appears in the paper.
These attributes can be predicted from the abstract text.
Reddit. Reddit is an online forum divided into topicspecific subforums called ‘subreddits’. We consider three subreddits: keto, okcupid, and childfree. In these subreddits, we identify users whose username flair includes a gender label (usually ‘M’ or ‘F’). We collect all toplevel comments from these users in 2018. We use each comment’s text and score, the number of likes minus dislikes from other users. The dataset includes 90k comments in the selected subreddits. We consider the direct effect of the labeled gender on posts’ scores.
Estimator. We use Causal BERT, explained in creftype 4. We truncate PeerRead abstracts to 250 wordpiece tokens, and Reddit posts to 128 wordpiece tokens. We begin with a BERT model pretrained on a general English language corpus. We further pretrain a BERT model on each dataset, running training on the unsupervised objective until convergence. In all cases, we use a logitlinear layer to predict treatment from embeddings, and a 2 hidden layer neural network for the expected outcome predictor.
For each experiment, we consider two downstream estimators: The simple estimators creftypeplural 4.3\crefpairconjunction4.1, and ‘onestep’ TMLE estimators [vanderLaan:Gruber:2016]. The latter are more sophisticated estimators that combine estimated conditional outcomes and propensities to achieve asymptotic robustness and efficiency properties. For all estimators, we exclude units that have a predicted propensity score greater than 0.97 or less than 0.03.
5.2 Results
Estimator Evaluation
Noise:  $\sigma =1.0$  $\sigma =4.0$  

Confounding: 
Low  Med.  High $30.0$pt  Low  Med.  High 
Ground truth  $1.00$  $1.00$  $1.00$  $1.00$  $1.00$  $1.00$ 
Unadjusted  $1.03$  $1.24$  $3.48$  $\colorbox[rgb]{0.8,0.8,0.8}{$0.99$}$  $1.22$  $3.51$ 
Words ${\widehat{\beta}}^{\mathrm{plugin}}$  $1.01$  $1.17$  $2.69$  $1.04$  $1.16$  $2.63$ 
Words ${\widehat{\beta}}^{\mathrm{TMLE}}$  $1.02$  $1.18$  $2.71$  $1.04$  $1.17$  $2.65$ 
LDA ${\widehat{\beta}}^{\mathrm{plugin}}$  $1.01$  $1.20$  $2.95$  $1.02$  $1.19$  $2.91$ 
LDA ${\widehat{\beta}}^{\mathrm{TMLE}}$  $\colorbox[rgb]{0.8,0.8,0.8}{$1.01$}$  $1.20$  $2.96$  $1.02$  $1.19$  $2.91$ 
${\widehat{\beta}}^{\mathrm{plugin}}$  $0.96$  $\colorbox[rgb]{0.8,0.8,0.8}{$1.05$}$  $\colorbox[rgb]{0.8,0.8,0.8}{$1.24$}$  $0.83$  $0.63$  $\colorbox[rgb]{0.8,0.8,0.8}{$1.31$}$ 
${\widehat{\beta}}^{\mathrm{TMLE}}$  $0.98$  $1.05$  $1.58$  $0.95$  $\colorbox[rgb]{0.8,0.8,0.8}{$1.00$}$  $1.51$ 
Confounding:  Low  Med.  High 

Ground truth  $0.06$  $0.05$  $0.03$ 
Unadjusted  $0.08$  $0.15$  $0.16$ 
Words ${\widehat{\psi}}^{Q}$  $0.07$  $0.13$  $0.15$ 
Words ${\widehat{\psi}}^{\mathrm{TMLE}}$  $0.07$  $0.13$  $0.15$ 
LDA ${\widehat{\psi}}^{Q}$  $0.06$  $0.06$  $0.06$ 
LDA ${\widehat{\psi}}^{\mathrm{TMLE}}$  $0.06$  $0.06$  $0.06$ 
${\widehat{\psi}}^{Q}$  $0.07$  $\colorbox[rgb]{0.8,0.8,0.8}{$0.06$}$  $0.01$ 
${\widehat{\psi}}^{\mathrm{TMLE}}$  $\colorbox[rgb]{0.8,0.8,0.8}{$0.06$}$  $0.07$  $\colorbox[rgb]{0.8,0.8,0.8}{$0.04$}$ 
Empirical evaluation of causal estimation procedures requires semisynthetic data because ground truth causal effects are usually not available for realworld data. For such evaluations to be compelling, the semisynthetic model must be reflective of realworld data. This is challenging for text data: there are no realistic generative models of text, so it is not possible to generate a confounder and then generate the text, treatment, and outcome on the basis of this confounder.
To circumvent this, we use real metadata—subreddit and title buzziness—as the confounders $\stackrel{~}{z}$ for the simulation. We simulate only the outcomes, using the treatment and the confounder. We compute the true propensity score $\pi (\stackrel{~}{z})$ as the proportion of units with ${t}_{i}=1$ in each strata of $\stackrel{~}{z}$. Then, ${Y}_{i}$ is simulated from the model:
$${Y}_{i}={t}_{i}+{\beta}_{1}(\pi (\stackrel{~}{{z}_{i}})0.5)+{\epsilon}_{i}\mathit{\hspace{1em}\hspace{1em}}{\epsilon}_{i}\sim N(0,\sigma ).$$ 
Or, for binary outcomes,
$${Y}_{i}\sim \text{Bernoulli}(\text{sigmoid}(0.25{t}_{i}+{b}_{1}(\pi ({\stackrel{~}{z}}_{i})0.2)))$$ 
The parameter ${b}_{1}$ controls the level of confounding; e.g., the bias of the unadjusted difference $\mathbb{E}[YT=1]\mathbb{E}[YT=0]$ increases with ${b}_{1}$. For PeerRead, we report estimates of the ATE for binary simulated outcomes. For Reddit, we compute the NDE for simulated realvalued outcomes.
Additionally, we compare against two baselines. The first is a twostage procedure that uses LDA to estimate documenttopic proportions $\widehat{z}$ and linear/logistic regression for $\widehat{Q}(\widehat{z})$ and $\widehat{g}(\widehat{z})$. The second fits linear/logistic regression for the expected outcomes and treatments using word counts directly without dimensionality reduction.
Results are summarized in creftypeplural 2\crefpairconjunction1. Compared to the unadjusted estimate, all methods for adjustment reduce confounding. However, causal BERT does substantially better for moderate to high confounding. This is even in a simulation setting favorable to LDA (the true confounding is topic, and has a simple relation to outcome). The benefits of dimensionality reduction on text are clear in PeerRead, where adjustment based on LDA is much better than using the words alone.
The effect of exogeneity.
We assume that the text carries all information about the confounding (or mediation) necessary to identify the causal effect. In many situations, this assumption may not be fully realistic. For example, in the simulations just discussed, it may not be possible to exactly recover the confounding from the text. We study the effect of violating this assumption by simulating both treatment and outcome from a confounder that consists of a part that can be fully inferred from the text and part that is wholly exogenous.
The challenge is finding a realistic confounder that can be exactly inferred from the text. Our approach is to (i) train BERT to predict the actual treatment of interest, producing propensity scores ${\widehat{g}}_{i}$ for each $i$, and (ii) use ${\widehat{g}}_{i}$ as the inferrable part of the confounding. Precisely, we simulate propensity scores as $logit{g}_{\text{sim}}=(1p)logit{\widehat{g}}_{i}+p{\xi}_{i}$, with ${\xi}_{i}\stackrel{iid}{\sim}\mathrm{N}(0,1)$. The outcome is simulated as above. When $p=0$, the simulation is fullyinferrable and closely matches real data. Increasing $p$ allows us to study the effect of exogeneity; see creftype 2. As expected, the adjustment quality decays. Remarkably, the adjustment improves the naive estimate at all levels of exogeneity—the method is robust to violations of the theoretical assumptions.
buzzy  theorem  

Unadjusted  $0.08\pm 0.01$  $0.21\pm 0.01$ 
${\widehat{\psi}}^{Q}$  $0.01\pm 0.03$  $0.03\pm 0.03$ 
${\widehat{\psi}}^{\mathrm{TMLE}}$  $0.06\pm 0.04$  $0.10\pm 0.03$ 
okcupid  childfree  keto  

Unadjusted  $0.18\pm 0.01$  $0.19\pm 0.01$  $0.00\pm 0.00$ 
${\widehat{\beta}}^{\mathrm{plugin}}$  $0.10\pm 0.04$  $0.10\pm 0.04$  $0.03\pm 0.02$ 
${\widehat{\beta}}^{\mathrm{TMLE}}$  $0.15\pm 0.05$  $0.16\pm 0.05$  $0.01\pm 0.00$ 
Application We apply causal BERT to estimate the treatment effect of buzzy and theorem, and the effect of gender on logscore in each subreddit; see creftypeplural 4\crefpairconjunction3. Although unadjusted estimates suggest strong effects, our results show this is in large part explainable by confounding or mediating. On PeerRead, the TMLE estimate ${\widehat{\psi}}^{\mathrm{TMLE}}$ suggests a positive effect from including a theorem on paper acceptance, but the $Q$only estimator does not. On Reddit, both estimates suggest a positive effect from labeling a post as female on its score in okcupid and childfree.