Abstract
We propose to metalearn causal structures based on how fast a learner adaptsto new distributions arising from sparse distributional changes, e.g. due tointerventions, actions of agents and other sources of nonstationarities. Weshow that under this assumption, the correct causal structural choices lead tofaster adaptation to modified distributions because the changes areconcentrated in one or just a few mechanisms when the learned knowledge ismodularized appropriately. This leads to sparse expected gradients and a lowereffective number of degrees of freedom needing to be relearned while adaptingto the change. It motivates using the speed of adaptation to a modifieddistribution as a metalearning objective. We demonstrate how this can be usedto determine the causeeffect relationship between two observed variables. Thedistributional changes do not need to correspond to standard interventions(clamping a variable), and the learner has no direct knowledge of theseinterventions. We show that causal structures can be parameterized viacontinuous variables and learned endtoend. We then explore how these ideascould be used to also learn an encoder that would map lowlevel observedvariables to unobserved causal variables leading to faster adaptationoutofdistribution, learning a representation space where one can satisfy theassumptions of independent mechanisms and of small and sparse changes in thesemechanisms due to actions and nonstationarities.
Quick Read (beta)
A MetaTransfer Objective for Learning to Disentangle Causal Mechanisms
Abstract
We propose to metalearn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes, e.g. due to interventions, actions of agents and other sources of nonstationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a metalearning objective. We demonstrate how this can be used to determine the causeeffect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned endtoend. We then explore how these ideas could be used to also learn an encoder that would map lowlevel observed variables to unobserved causal variables leading to faster adaptation outofdistribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and nonstationarities.
Olexa Bilaniuk${}^{\mathrm{1}}$, Anirudh Goyal ${}^{\mathrm{1}}$ and Christopher Pal${}^{\mathrm{3}\mathrm{,}\mathrm{5}}$
Mila, Montréal, Québec, Canada
${}^{\mathrm{1}}$ Université de Montréal
${}^{\mathrm{2}}$ CIFAR Senior Fellow
${}^{\mathrm{3}}$ École Polytechnique Montréal
${}^{\mathrm{4}}$ RuprechtKarlsUniversität Heidelberg
${}^{\mathrm{5}}$ Canada CIFAR AI Chair
1 Introduction
Current machine learning methods seem weak when they are required to generalize beyond the training distribution, which is what is often needed in practice. It is not enough to obtain good generalization on a test set sampled from the same distribution as the training data, we would also like what has been learned in one setting to generalize well in other related distributions. These distributions may involve the same concepts that were seen previously by the learner, with the changes typically arising because of actions of agents. More generally, we would like what has been learned previously to form a rich base from which very fast adaptation to a new but related distribution can take place, i.e., obtain good transfer. Some new concept may have to be learned but because most of the other relevant concepts have already been captured by the learner (as well as how they can be composed), learning can be very fast on the transfer distribution.
Short of any assumption, it is impossible to have a successful transfer to an unrelated distribution. In this paper we focus on the assumption that the changes are sparse when the knowledge is represented in an appropriately modularized way, with only one or a few of the modules having changed. This is especially relevant when the distributional change is due to actions by one or more agents, such as the interventions discussed in the causality literature (Pearl, 2009; Peters et al., 2017), where a single causal variable is clamped to a particular value. In general, it is difficult for agents to influence many underlying causal variables at a time, and although this paper is not about agent learning as such, this is a property of the world that we propose to exploit here, to help discovering these variables and how they are causally related to each other.
To motivate the need for inferring causal structure, consider that interventions may be actually performed or may be imagined. In order to properly plan in a way that takes into account interventions, one needs to imagine a possible change to the joint distribution of the variables of interest due to an intervention, even one that has never been observed before. This goes beyond good transfer learning and requires causal learning and causal reasoning. For this purpose, it is not sufficient to learn the joint distribution of the observed variables. One also should learn enough about the underlying highlevel variables and their causal relations to be able to properly infer the effect of an intervention. For example, $A$=Raining causes $B$=Open Umbrella (and not viceversa). Changing the marginal probability of Raining (say because the weather changed) does not change the mechanism that relates $A$ and $B$ (captured by $P(BA)$), but will have an impact on the marginal $P(B)$. Conversely, an agent’s intervention on $B$ (Open umbrella) will have no effect on the marginal distribution of $A$ (Raining). That asymmetry is generally not visible from the $(A,B)$ training pairs alone, until a change of distribution occurs, e.g. due to an intervention. This motivates the setup of this paper, where one learns from a set of distributions arising from not necessarily known interventions, not simply to capture a joint distribution but to discover the some underlying causal structure.
Machine learning methods are often exploiting some form of assumption about the data distribution (or else, the no free lunch theorem tells us that we cannot have any confidence in generalization). In this paper, we are considering not just assumptions on the data distribution but also on how it changes (e.g., when going from a training distribution to a transfer distribution, possibly resulting from some agent’s actions). We propose to rely on the assumption that, when the knowledge about the distribution is appropriately represented, these changes would be small. This arises because of an underlying assumption (but more difficult to verify directly) that only one or few of the ground truth mechanisms have been changed, due to some generalized form of intervention leading to the modified distribution.
How can we exploit this assumption? As we explain theoretically and verify experimentally here, if we have the right knowledge representation, then we should get fast adaptation to the transfer distribution when starting from a model that is well trained on the training distribution. This arises because of our assumption that the ground truth data generative process is obtained as the composition of independent mechanisms and that, very few ground truth mechanisms and parameters need to change when going from the training distribution to the transfer distribution. A model capturing a corresponding factorization of knowledge would thus require just a few updates, a few examples, for this adaptation to the transfer distribution. As shown below, the expected gradient on the unchanged parameters would be near 0 (if the model was already well trained on the training distribution), so the effective search space during adaptation to the transfer distribution would be greatly reduced, which tends to produce fast adaptation, as found experimentally.
Thus, based on the assumption of small change in the right knowledge representation space, we can define a metalearning objective that measures the speed of adaptation, i.e., a form of regret, in order to optimize the way in which knowledge should be represented, factorized and structured. This is the core idea presented in this paper. Note that a stronger signal can be obtained when there are more nonstationarities, i.e., many changes in distribution, just like in metalearning we get better results with more metaexamples.
In this way, we can take what is normally considered a nuisance in machine learning (changes in distribution due to nonstationarity, uncontrolled interventions, etc.) and turn that into a training signal to find a good way to factorize knowledge into components and mechanisms that match the assumption of small change. Thus, we end up optimizing in an endtoend way the very thing we care about at the end, i.e. fast transfer and robustness to distributional changes. If the data was really generated from the composition of independent causal mechanisms (Peters et al., 2017), then there exists a good factorization of knowledge that mimics that structure. If in addition, at each time step, agents in the real world tend to only be able to change one or very few highlevel variables (or the associated mechanisms producing them), then our assumption of small change (in the right representation) should be generally valid. Also, in addition to obtaining fast transfer, we may be able to recover a good approximation of the true causal decomposition into independent mechanisms (to the extent that the observations and interventions can reveal those mechanisms).
In this paper, we begin exploring the above ideas with specific experiments on synthetically generated data in order to validate them and demonstrate the existence of simple algorithms to exploit them. However it is clear to us that much more work will be needed to evaluate the proposed approach in a diversity of settings and with different specific parametrizations, training objectives, environments, etc. We begin with what are maybe the simplest possible settings and evaluate whether the above approach can be used to learn the direction of causality. We then study the crucial question of obtaining a training signal about how to transform raw observed data into a representation space where the latent variables can be modeled by a sparse causal graph with sparse distributional changes and show results that confirm that the correct encoder leads to a better value of our expected regret metalearning objective.
2 Which is Cause and Which is Effect?
To anchor ideas and show an example of application of the aboveproposed metaobjective for knowledge decomposition, we consider in this section the problem of determining if variable $A$ causes variable $B$ or viceversa. The learner observes training samples $(a,b)$ from a pair of related distributions, which by convention we call the training distribution and the transfer distribution. Note that based only on samples from a single (training) distribution, in general both the $A\to B$ model ($A$ causes $B$) and the $B\to A$ model (viceversa, see Equation (2.1) below) tend to perform as well in terms of ordinary generalization (to a test set sampled from the training distribution), see also a theoretical argument and simulation results in Appendix A. To highlight the power of the proposed metalearning objective, we consider the situation where lots of examples are available for the training distribution but very few for the transfer distribution. In fact, as we will argue below, the training signal that will allow us to infer the correct causal direction will be stronger if we have access to many short transfer adaptation episodes. Short episodes are most informative because after having seen a lot of data from the transfer distribution, it will not matter much whether $A$ causes $B$ or viceversa (when there is enough training data compared to the number of free parameters, both models converge towards an optimal estimation of the joint). However, in order to generalize quickly from very few examples of the transfer distribution, it does matter to have made the correct choice of the causal direction. Let us now justify this in more detail below and then demonstrate this by simulations.
2.1 Learning a Causal Graph with two Discrete Variables
Let both $A$ and $B$ be discrete variables each taking $N$ possible values and consider the following two parametrizations (the $A\to B$ model and the $B\to A$ model) to estimate their joint distribution:
${P}_{A\to B}(A,B)$  $={P}_{A\to B}(A){P}_{A\to B}(B\mid A)$  
${P}_{B\to A}(A,B)$  $={P}_{B\to A}(B){P}_{B\to A}(A\mid B)$  (1) 
Each of these two graphical models (denoted $A\to B$ and $B\to A$) decomposes the joint into two separately parametrized modules, each corresponding to a different causal mechanism associated with the probability of a variable given its parents in the graph. This amounts to four modules: ${P}_{A\to B}(A)$, ${P}_{A\to B}(B\mid A)$, ${P}_{B\to A}(B)$ and ${P}_{B\to A}(A\mid B)$. We will train both models independently. Since we assume in this section that the pairs $(A,B)$ are completely observed, we can use a simple maximum likelihood estimator to independently train all four modules (the loglikelihood of the joint decomposes into separate objective functions, one for each conditional, in a directed graphical model with fully observed variables). In the discrete case with tabular parametrization, the maximum likelihood estimator can be computed analytically, and corresponds to the appropriately normalized relative frequencies. Let $\theta $ denote the parameters of all these models, split into subvectors for each module, e.g., ${\theta}_{AB}$ for the ${N}^{2}$ conditional probabilities for each possible value of $B$ and each possible value of $A$. In our experiments, we parametrized these probabilities via softmax of unnormalized quantities.
2.1.1 The Advantage of the Correct Causal Model
First, let us consider simply the likelihood of the training data only (i.e., no change of distribution) for the different causal models considered. Both models have $O({N}^{2})$ parameters, and maximum likelihood estimation leads to indistinguishable test set performance (where the test set is sampled from the training distribution). See Appendix A for a demonstration that both models would have the same likelihood, and associated experimental results. These results are not surprising in light of the existing literature on nonidentifiability of causality from observations (Pearl, 2009; Peters et al., 2017), but they highlight the importance of using changes in distribution to provide a signal about the causal structure.
Now instead let us compare the performance of our two hypotheses ($A\to B$ vs $B\to A$) in terms of how fast the two models adapt on a transfer distribution after having been trained on the training distribution. We will assume simple stochastic gradient descent on the parameters for this adaptation but other procedures could be used, of course. Without loss of generality, let $A\to B$ be the correct causal model. To make the case stronger, let us consider that the change between the two distributions amounts to a random change in the parameters of the true $P(A)$ for the cause $A$ (because this will have an impact on the effect $B$, which can be picked up and reveal the causal direction). We do not assume that the learner knows what intervention was performed, unlike in more common approaches to causal discovery and controlled experiments. We only assume that some change happened and we try to exploit that to reveal structural causal information.
2.2 Experiments on Adaptation to the transfer distribution
We present experiments comparing the learning curve of the correct causal model on the transfer distribution vs the learning curve of the incorrect model. The adaptation with only a few gradient steps on data coming from a different, but related, transfer distribution is critical in getting a signal that can be leveraged by our metalearning algorithm. To show the effect of this adaptation, and motivate our use of only a small amount of data from the transfer distribution, we experimented with a model on discrete random variables taking $N=10$ possible values.
In this experiment, we fixed the underlying causal model to be $A\to B$, and trained the modules for each marginal and conditional distributions with maximum likelihood on a large amount of data from some training distribution, as explained in Appendix A. See also Appendix G.1 and Table G.1 for details on the definitions of these modules.
We then adapt all the modules on data coming from a transfer distribution, corresponding on an intervention on the random variable $A$ (i.e., the marginal $P(A)$ of the ground truth model is modified, while leaving $P(B\mid A)$ fixed). We used RMSprop for the adaptation, with the same learning rate. For assessing reproducibility and statistical robustness, the experiment was repeated over 100 different training distributions, and over 100 transfer distributions for each training distributions, leading to 10 000 experiments overall. The procedure to acquire different training/transfer distributions is details in Appendix G.1.
In Figure 1, we report the loglikelihoods of both models, evaluated on a large test set of 10 000 from the transfer distribution. We can see that as the number of examples from the transfer distribution (equal to the number of adaptation steps) increases, the two models eventually reach the same loglikelihood, reflecting our observation from Appendix A. However the causal model $A\to B$ adapts faster than the other model $B\to A$, with the most informative part of the trajectory (where the difference is the largest) is within the first 10 to 20 examples.
2.2.1 Parameter Counting Argument
A simple parameter counting arguments helps us understand what we are observing in Figure 1. First, consider the expected gradient on the parameters of the different modules, during the adaptation phase to the transfer distribution, which we designate as adaptation episode, and corresponds to learning from a metaexample.
Proposition 1
The expected gradient over the transfer distribution of the regret (accumulated negative loglikelihood during the adaptation episode) with respect to the module parameters is zero for the parameters of the modules that (a) were correctly learned in the training phase, and (b) have the correct set of causal parents, corresponding to the ground truth causal graph, if (c) the corresponding ground truth conditional distributions did not change from the training distribution to the transfer distribution.
The proof is given in Appendix B. The basic justification for this proposition is that for the modules that were correctly learned in the training distribution and whose ground truth conditional distribution did not change with the transfer distribution, the parameters already are at a maximum of the loglikelihood over the transfer distribution, so the expected gradient is zero.
As a consequence, the effective number of parameters that need to be adapted, when one has the correct causal graph structure, is reduced to those of the mechanisms that actually changed from the training to the transfer distribution. Since sample complexity  the number of training examples necessary to learn a model  grows approximately linearly (Ehrenfeucht et al., 1989) with VCdimension (Vapnik and Chervonenkis, 1971), and since VCdimension grows approximately linearly in the number of parameters in linear models and neural networks ShalevShwartz and BenDavid (2014), the learning curve on the transfer distribution will tend to improve faster for the model with the correct causal structure, for which fewer parameters need to be changed. Interestingly, we do not need to have the whole causal graph correctly specified before getting benefits from this phenomenon. If we only have part of the causal graph correctly specified and we change our causal hypothesis to include one more correctly specified mechanism, then we will obtain a gain in terms of the adaptation sample complexity (which shows up when the change in distribution does not touch that mechanism). This nice property also shows up in Proposition 1 (Appendix F), showing a decoupling of the metaobjective across the independent mechanisms.
Let us consider the special case we have been studying up to now. We have four modules, two of which (${P}_{A\to B}(A)$ and ${P}_{B\to A}(B)$) are marginal discrete distributions over $N$ values, which require each $N1$ free parameters. The other two modules are conditional probability tables that have $N$ rows each with $N1$ free parameters, i.e., a total of $N(N1)$ free parameters. If $A\to B$ is the correct model and the transfer distribution only changed the true $P(A)$ (the cause), and if $P(B\mid A)$ had been correctly estimated on the training distribution, then for the correct model only $N1$ parameters need to be reestimated. On the other hand, because of Bayes’ rule, under the incorrect model ($B\to A$), a change in $P(A)$ leads to new parameters for both $P(B)$ and $P(A\mid B)$, i.e., all $N(N1)+(N1)={N}^{2}1$ parameters must be reestimated. In this case we see that sample complexity may be $O({N}^{2})$ for the incorrect model while it would be $O(N)$ for the correct model (assuming linear relationship between sample complexity and number of free parameters). Of course, if the change in distribution had been over $P(B\mid A)$ instead of $P(A)$, the advantage would not have been as great. This would motivate information gathering actions generally resulting in a very sparse change in the mechanisms.
2.3 Smooth parameterization of the causal structure
In the more general case with many more than two hypotheses for the structure of the causal graph, there will be an exponentially large set of possible causal structures explaining the data and we won’t be able to enumerate all of them (and pick the best one after observing episodes of adaptation). However, we can parameterize our belief about an exponentially large set of hypotheses by keeping track of the probability for each directed edge of the graph to be present, i.e., specify for each variable $B$ whether some variable $A$ is a direct causal parent of $B$ (for all pairs $(A,B)$ in the graph). We will develop such a smooth parametrization further in Appendix F, but it hinges on gradually changing our belief in the individual binary decisions associated with each edge of the causal graph, so we can jointly do gradient descent on all these beliefs at the same time.
In this section, we study the simplest possible version of this idea, representing that edge belief via a structural parameter $\gamma $ with $\mathrm{sigmoid}(\gamma )=\mathrm{sigmoid}(\gamma )$, our believed probability that $A\to B$ is the correct choice. For that single pair of variables scenario, let us consider two explanations for the data (as in the above sections, for models $A\to B$ and $B\to A$), one with probability $p(A\to B)=\mathrm{sigmoid}(\gamma )$ and the other with probability $p(B\to A)=1\mathrm{sigmoid}(\gamma )$. We can write down our transfer objective as a loglikelihood over the mixture of these two models. Note this is different from the usual mixture models, which assume separately for each example that it was sampled from one component or another with some probability. Here, we assume that all of the observed data was sampled from one component or the other. The transfer data regret (negative loglikelihood accumulated along the online adaptation trajectory) under that mixture is therefore as follows:
$$\mathcal{R}=\mathrm{log}\left[\mathrm{sigmoid}(\gamma ){\mathcal{L}}_{A\to B}+(1\mathrm{sigmoid}(\gamma )){\mathcal{L}}_{B\to A}\right]$$  (2) 
where ${\mathcal{L}}_{A\to B}$ and ${\mathcal{L}}_{B\to A}$ are the online likelihoods of both models respectively on the transfer data. They are defined as
${\mathcal{L}}_{A\to B}$  $={\displaystyle \prod _{t=1}^{T}}{P}_{A\to B}({a}_{t},{b}_{t};{\theta}_{t})$  
${\mathcal{L}}_{B\to A}$  $={\displaystyle \prod _{t=1}^{T}}{P}_{B\to A}({a}_{t},{b}_{t};{\theta}_{t}),$ 
where ${\{({a}_{t},{b}_{t})\}}_{t}$ is the set of transfer examples for a given episode and ${\theta}_{t}$ aggregates all the modules’ parameters as of time step $t$ (since the parameters could be updated after each observation of an example $({a}_{t},{b}_{t})$ from the transfer distribution). ${P}_{\mathrm{model}}(a,b;\theta )$ is the likelihood of example $(a,b)$ under some model that has parameters $\theta $.
The quantity of interest here is $\frac{\partial \mathcal{R}}{\partial \gamma}$, which is our training signal for updating $\gamma $. In the experiments below, after each episode involving $T$ transfer examples we update $\gamma $ by doing one step of gradient descent, to reduce the transfer negative loglikelihood or regret $\mathcal{R}$. What we are proposing is a metalearning framework in which the inner training loop updates the module parameters (separately) as examples are seen (from either distribution being currently observed), while the outer loop updates the structural parameters (here it is only the scalar $\gamma $) with respect to the transfer negative loglikelihood.
The gradient of the transfer loglikelihood with respect to the structural parameter $\gamma $ is pushing $\mathrm{sigmoid}(\gamma )$ towards the posterior probability that the correct model is $A\to B$ and $(1\mathrm{sigmoid}(\gamma ))$ towards the posterior probability that the correct model is $B\to A$:
Proposition 2
The gradient of the negative loglikelihood of the transfer data in Equation (2) wrt. the structural parameter $\frac{\mathrm{\partial}\mathit{}\mathrm{R}}{\mathrm{\partial}\mathit{}\gamma}$ is given by
$$\frac{\partial \mathcal{R}}{\partial \gamma}=\sigma (\gamma )P(A\to B\mid {D}_{2}),$$  (3) 
where ${D}_{\mathrm{2}}$ is the transfer data, and $P\mathrm{(}A\mathrm{\to}B\mathrm{\mid}{D}_{\mathrm{2}}\mathrm{)}$ is the posterior probability of the hypothesis $A\mathrm{\to}B$ (when the alternative is $B\mathrm{\to}A$). Furthermore, this can be equivalently written as
$$\frac{\partial \mathcal{R}}{\partial \gamma}=\sigma (\gamma )\sigma (\gamma +\mathrm{\Delta}),$$  (4) 
where $\mathrm{\Delta}\mathrm{=}\mathrm{log}\mathit{}{\mathrm{L}}_{A\mathrm{\to}B}\mathrm{}\mathrm{log}\mathit{}{\mathrm{L}}_{B\mathrm{\to}A}$ is the difference between the loglikelihoods of the two hypotheses on the transfer data ${D}_{\mathrm{2}}$.
The proof is given in Appendix D. Note how this posterior probability is basically measuring which hypothesis is better explaining the episode transfer data ${D}_{2}$ overall along the adaptation trajectory. ${D}_{2}$ is a metaexample for updating the structural parameters like $\gamma $. Larger $\mathrm{\Delta}$ of one hypothesis over the other leads to moving metaparameters faster towards the favoured hypothesis. This difference in online accumulated loglikelihoods $\mathrm{\Delta}$ also relates to loglikelihood scores in scorebased methods for structure learning of graphical models (Koller and Friedman, 2009)^{1}^{1} 1 One can see $\mathrm{log}{\mathcal{L}}_{A\to B}$ as a score attributed to graph $A\to B$, analogously for $\mathrm{log}{\mathcal{L}}_{B\to A}$. The gradient is then pushing toward the graph with the highest score..
To find where SGD converges, note that the actual posterior depends on the prior $\mathrm{sigmoid}(\gamma )$ and thus keeps changing after each gradient step. We are really doing SGD on the expected value of $\mathcal{R}$ over transfer sets ${D}_{2}$. Equating the gradient of this expected value to zero to look for the stationary convergence point, we thus see $\mathrm{sigmoid}(\gamma )$ on both sides of the equation, and we obtain convergence when the new value of $\mathrm{sigmoid}(\gamma )$ is consistent with the old value, as clarified in this proposition.
Proposition 3
Stochastic gradient descent (with appropriately decreasing learning rate) on ${E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{R}\mathrm{]}$ with steps from $\frac{\mathrm{\partial}\mathit{}\mathrm{R}}{\mathrm{\partial}\mathit{}\gamma}$ converges towards $\mathrm{sigmoid}\mathit{}\mathrm{(}\gamma \mathrm{)}\mathrm{=}\mathrm{1}$ if ${E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{log}\mathit{}{\mathrm{L}}_{A\mathrm{\to}B}\mathrm{]}\mathrm{>}{E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{log}\mathit{}{\mathrm{L}}_{B\mathrm{\to}A}\mathrm{]}$, or $\sigma \mathit{}\mathrm{(}\gamma \mathrm{)}\mathrm{=}\mathrm{0}$ otherwise.
The proof is given in Section E of the Appendix, and shows that optimizing $\gamma $ will end up picking the correct hypothesis, i.e., the one that has the smallest regret (or fastest convergence), measured as the accumulated loglikelihood as adaptation proceeds on the transfer distributions sampled from the distribution ${D}_{2}$, which we can think of like a distribution over tasks, in metalearning. This analogy with metalearning also appears in our gradientbased adaptation procedure, which is linked to existing methods like the firstorder approximation of MAML (Finn et al., 2017), and its related algorithms (Nichol et al., 2018). Algorithm 1 (Appendix C) illustrates the general pseudocode for the proposed metalearning framework.
2.3.1 Experimental Results
To illustrate the convergence result from Proposition 3, we experiment with learning the structural parameter $\gamma $ in a bivariate model, with discrete random variables, each taking $N=10$ and $N=100$ possible values. In this experiment, we assume that the underlying causal model (unknown to the algorithm) is fixed to $A\to B$, so that we want the structural parameter to eventually converge to $\sigma (\gamma )=1$. The details of the experimental setup can be found in Appendix G.1.
In Figure 2, we show the evolution of $\sigma (\gamma )$ (which is the model’s belief of $A\to B$ being the correct causal model) as the number of episodes increases. Starting from an equal belief for both $A\to B$ and $B\to A$ to occur ($\sigma (\gamma )=0.5$), the structural parameter converges to $\sigma (\gamma )=1$ within $500$ episodes.
This observation is consistent across a range of domains, including models with multimodal or multivariate continuous variables, and different parametrizations of the models. In Appendix G.2, we present results for two discrete variables but using MLPs to parametrize the conditional distributions, and where there are more causal hypotheses: we consider one binary choice for each directed edge in the graph, to decide whether one variable is a direct causal parent or not. Figure 3 shows that the correct causal graph is quickly recovered. To estimate the gradient, we use a generalization of the regret loss (introduced above, Equation (2)) and its gradient, described in Appendix F.
In Appendix G.3, we consider the case of continuous scalar multimodal variables. The ground truth joint distribution is obtained by making the effect $B$ a nonlinear function $f(A)$ of cause $A$, where $f$ is a randomly generated spline. Figure G.1 shows an example of a resulting joint distribution. We model the conditionals with mixture density networks (Bishop, 1994) and the marginals by a Gaussian mixture. We obtain results that are similar to the discrete case, with the correct causal interpretation being recovered quickly, as illustrated in Figure G.2.
We also show in Appendix G.4 results on models with two continuous random variables, each being distributed as a multivariate Gaussian, with $N=10$ dimensions. Similar to the experiment with discrete random variables, the same argument about parameter counting mentioned in Section 2.2.1 holds here. Again, we obtain results consistent with the previous examples, where the structural parameter $\gamma $ converges to $1$, effectively recovering the correct causal model $A\to B$.
3 Representation Learning
So far, we have assumed that the system has unrestricted access to the true underlying causal variables, $A$ and $B$. However in many realistic scenarios for learning agents, the observations available to the learner might not be instances of the true causal variables but sensorylevel data instead, like pixels and sounds. If this is the case, our working assumption – that the correct causal graph will be sparsely connected, made of independent components, and affected sparsely by distributional shifts – can not be expected to hold true in general in the space of observed variables. To tackle this, we propose to follow the deep learning objective of disentangling the underlying causal variables (Bengio et al., 2013), and learn a representation in which these properties hold. In the simplest form of this setting, the learner must map its raw observations to a hidden representation space $H$ via an encoder $\mathcal{E}$. The encoder is trained such that the hidden space $H$ helps to optimize the metatransfer objective described above, i.e., we consider the encoder, along with $\gamma $, as part of the set of structural or metaparameters to be optimized with respect to the metatransfer objective.
To study this simplified setting, we consider that our raw observations $(X,Y)$ originate from the true causal variables $(A,B)$ via the action of a ground truth decoder $\mathcal{D}$ (or generator network) that the learner is not aware of but is implicitly trying to invert, as illustrated in Figure 4. The variables $A$, $B$, $X$ and $Y$ are assumed to be scalars, and we first consider $\mathcal{D}$ be a rotation matrix such that:
$$\left[\begin{array}{c}\hfill X\hfill \\ \hfill Y\hfill \end{array}\right]=R({\theta}_{\mathcal{D}})\left[\begin{array}{c}\hfill A\hfill \\ \hfill B\hfill \end{array}\right]$$  (5) 
The encoder is set to another rotation matrix, one that maps the observations $X,Y$ to the hidden representation $U,V$ as follows:
$$\left[\begin{array}{c}\hfill U\hfill \\ \hfill V\hfill \end{array}\right]=R({\theta}_{\mathcal{E}})\left[\begin{array}{c}\hfill X\hfill \\ \hfill Y\hfill \end{array}\right]$$  (6) 
The causal modules are now to be trained on the variables $U$ and $V$ in the same way as detailed in Section 2. Indeed, if the encoder is valid one would obtain either $(U,V)=(A,B)$ or $(U,V)=(B,A)$ up to a negative sign, but we say without loss of generality that $(U,V)=(A,B)$, corresponding to the solution ${\theta}_{\mathcal{E}}={\theta}_{\mathcal{D}}$. In this case, the model $U\to V$ is causal and should therefore have an advantage over the anticausal model $V\to U$, as far as adaptation speed on the transfer distribution is concerned. However, if the encoder is not valid, one would obtain superpositions of the form:
$U$  $=\mathrm{cos}(\theta )A\mathrm{sin}(\theta )B$  (7)  
$V$  $=\mathrm{sin}(\theta )A+\mathrm{cos}(\theta )B$  (8) 
where $\theta ={\theta}_{\mathcal{E}}+{\theta}_{\mathcal{D}}$. In the extremum where $\theta =\frac{\pi}{4}$, it is clear that the model $U\to V$ will not have an advantage over the model $V\to U$ in terms of regret on the transfer distribution. However, the question we are interested in is whether it is possible to learn the encoder ${\theta}_{\mathcal{E}}$. We verify this experimentally using Algorithm 1, but where the metaparameters are now both $\gamma $ (choosing between cause and effect which is which) and the parameters of the encoder (here the angle of a rotation matrix). The details of that experiment are provided in Appendix H, which illustrates – see Figure 5 – how the proposed objective can disentangle (here in a very simple setting) the ground truth variables (up to permutation).
4 Related Work
Although this paper focuses on the causal graph, the proposed objective is motivated by the more general question of discovering the underlying causal variables (and their dependencies) that explain the environment of a learner and make it possible for that learner to plan appropriately. The discovery of underlying explanatory variables has come under different names, in particular the notion of disentangling underlying variables (Bengio et al., 2013). As stated already by Bengio et al. (2013) and clearly demonstrated by Locatello et al. (2018), assumptions, priors or biases are necessary to identify the underlying explanatory variables. The latter paper (Locatello et al., 2018) also reviews and evaluates recent work on disentangling, and discusses different metrics that have been proposed. An extreme view of disentangling is that the explanatory variables should be marginally independent, and many deep generative models (Goodfellow et al., 2016) and independent component analysis models (Hyvärinen et al., 2001, 2018) are built on this assumption. However, the kinds of highlevel variables that we manipulate with natural language are not marginally independent: they are related to each other through statements that are usually expressed in sentences (e.g., a classical symbolic AI fact or rule), involving only a few concepts at a time. This kind of assumption has been proposed to help discover such linguistically relevant highlevel representations from raw observations, such as the consciousness prior (Bengio, 2017), with the idea that humans focus at any particular time on just a few concepts that are present to our consciousness. The work presented here could provide an interesting metalearning objective to help learn such encoders as well as figure out how the resulting variables are related to each other. In that case, one should distinguish two important assumptions: the first one is that the causal graph is sparse (has few edges, as in the consciousness prior (Bengio, 2017) and in some methods to learn Bayes net structure, e.g. (Schmidt et al., 2007)), corresponding to the consciousness prior; and the second one is that it changes sparsely due to interventions (which is the focus of this work).
Approaches for Bayesian network structure learning based on discrete search over model structures and simulated annealing are reviewed in Heckerman et al. (1995). There, it has been common to use Minimum Description Length (MDL) principles to score and search over models Lam and Bacchus (1993); Friedman and Goldszmidt (1998), or the Bayesian Information Criterion (BIC) to search for models with high relative posterior probability Heckerman et al. (1995). Prior work such as Heckerman et al. (1995) has also relied upon purely observational data, without the possibility of interventions and therefore focused on learning likelihood or hypothesis equivalence classes for network structures. Since then, numerous methods have also been devised to infer the causal direction from purely observational data (Peters et al., 2017), based on specific, generally parametric assumptions, on the underlying causal graph. Pearl’s seminal work on docalculus Pearl (1995, 2009); Bareinboim and Pearl (2016) lays a foundation for expressing the impact of interventions on probabilistic graphical models – we use it in our work. In contrast, here we are proposing a metalearning objective function for learning causal structure, not requiring any specific constraints on causal graph structure, only on the sparsity of the changes in distribution in the correct causal graph parametrization.
Our work is also related to other recent advances in causation, domain adaptation and transfer learning. Magliacane et al. (2018) have sought to identify a subset of features that lead to the best predictions for a variable of interest in a source domain such that the conditional distribution of the variable of interest given these features is the same in the target domain. Johansson et al. (2016) examine counterfactual inference and formulate it as a domain adaptation problem. Shalit et al. (2017) propose a technique called counterfactual regression for estimating individual treatment effects from observational data. RojasCarulla et al. (2018) propose a method to find an optimal subset that makes the target independent from the selection variables. To do so, they make the assumption that if the conditional distribution of the target given some subset is invariant across different source domains, then this conditional distribution must also be the same in the target domain. Parascandolo et al. (2017) propose an algorithm to recover a set of independent causal mechanisms by establishing competition between mechanisms, hence driving specialization. More recently, Dasgupta et al. (2019) adopted a metalearning approach to draw causal inferences from purely observational data.
5 Conclusion and Future Work
We have established in very simple bivariate settings that the rate at which a learner adapts to sparse changes in the distribution of observed data can be exploited to select or optimize causal structure and disentangle the causal variables. This relies on the assumption that with the correct causal structure, those distributional changes are localized and sparse. We have demonstrated these ideas through theoretical results as well as experimental validation. See https://github.com/ec6dde01667145e58de60f864e05a4/CausalOptimizationAnon for source code of the experiments.
This work is only a first step in the direction of optimizing causal structure based on the speed of adaptation to modified distributions. On the experimental side, many settings other than those studied here should be considered, with different kinds of parametrizations, richer and larger causal graphs, different kinds of optimization procedures, etc. Also, much more needs to be done in exploring how the proposed ideas can be used to learn good representations in which the causal variables are disentangled, since we have only experimented at this point with the simplest possible encoder with a single degree of freedom. Scaling up these ideas would permit their application towards improving the way in which learning agents deal with nonstationarities, and thus improving sample complexity and robustness of learning agents.
References
 Bareinboim and Pearl (2016) Elias Bareinboim and Judea Pearl. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
 Bengio (2017) Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8):1798–1828, 2013.
 Bishop (1994) Christopher M Bishop. Mixture density networks. Technical report, Citeseer, 1994.
 Dasgupta et al. (2019) Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic, Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia, Matthew Botvinick, and Zeb KurthNelson. Causal reasoning from metareinforcement learning, 2019. URL https://openreview.net/forum?id=H1ltQ3R9KQ.
 Ehrenfeucht et al. (1989) Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3), 1989.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. International Conference on Machine Learning (ICML), 2017.
 Friedman and Goldszmidt (1998) Nir Friedman and Moises Goldszmidt. Learning bayesian networks with local structure. In Learning in graphical models, pages 421–459. Springer, 1998.
 Goodfellow et al. (2016) Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL http://deeplearningbook.org.
 Heckerman et al. (1995) David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
 Hyvärinen et al. (2001) Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis. WileyInterscience, May 2001. ISBN 047140540X.
 Hyvärinen et al. (2018) Aapo Hyvärinen, Hiroaki Sasaki, and Richard E. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. CoRR, arXiv:1805.08651, 2018.
 Johansson et al. (2016) Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, pages 3020–3029, 2016.
 Koller and Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.
 Lam and Bacchus (1993) Wai Lam and Fahiem Bacchus. Using causal information and local measures to learn bayesian networks. In Proceedings of the Ninth international conference on Uncertainty in artificial intelligence, pages 243–250. Morgan Kaufmann Publishers Inc., 1993.
 Locatello et al. (2018) Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. CoRR, arXiv:1811.12359, 2018.
 Magliacane et al. (2018) Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems, pages 10869–10879, 2018.
 Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On FirstOrder MetaLearning Algorithms. arXiv:1803.02999, 2018.
 Parascandolo et al. (2017) Giambattista Parascandolo, Niki Kilbertus, Mateo RojasCarulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017.
 Pearl (1995) Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 RojasCarulla et al. (2018) Mateo RojasCarulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342, 2018.
 Schmidt et al. (2007) Mark W. Schmidt, Alexandru NiculescuMizil, and Kevin P. Murphy. Learning graphical model structure using l1regularization paths. In AAAI, 2007.
 ShalevShwartz and BenDavid (2014) Shai ShalevShwartz and Shai BenDavid. Understanding Machine Learning  from Theory to Algorithms. Cambridge University Press, 2014.
 Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3076–3085. JMLR. org, 2017.
 Vapnik and Chervonenkis (1971) V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971.
A Results on NonIdentifiability of Causal Structure
We show here that the maximum likelihood estimation of both models specified in Equation (2.1) yields the same estimated distribution over $A$ and $B$, i.e., the joint likelihood on the training distribution is not sufficient to distinguish the $A\to B$ and $B\to A$ causal models, in the nonparametric case (no assumption at all on the family of distributions). Let
${\theta}_{i}$  $={P}_{A\to B}(A=i)$  ${\theta}_{ji}$  $={P}_{A\to B}(B=j\mid A=i)$  
${\eta}_{j}$  $={P}_{B\to A}(B=j)$  ${\eta}_{ij}$  $={P}_{B\to A}(A=i\mid B=j).$ 
We now state the maximum likelihood estimators for each models:
${\widehat{\theta}}_{i}$  $={n}_{i}/n$  ${\widehat{\theta}}_{ji}$  $={n}_{ij}/{n}_{i}$  
${\widehat{\eta}}_{j}$  $={n}_{j}/n$  ${\widehat{\eta}}_{ij}$  $={n}_{ij}/{n}_{j}$  (9) 
where $n$ is the total number of observations, ${n}_{i}$ the number of times we observed $A=i$, ${n}_{j}$ the number of times we observed $B=j$ and ${n}_{ij}$ the number of times we observed $A=i$ and $B=j$ jointly. We can now compute the likelihood for each model:
${\widehat{P}}_{A\to B}(A,B)$  $={\widehat{\theta}}_{i}{\widehat{\theta}}_{ji}={n}_{ij}/n$  
${\widehat{P}}_{B\to A}(A,B)$  $={\widehat{\eta}}_{j}{\widehat{\eta}}_{ij}={n}_{ij}/n$  (10) 
which is what we intended to show. To illustrate this result, we also experiment with learning the modules for both models $A\to B$ and $B\to A$ with SGD. In Figure A.1, we show the difference in loglikelihoods between these two models, evaluated on training and test data sampled from the same distribution, during training. We can see that while the model $A\to B$ fits the data faster than the other model (corresponding to a positive difference in Figure A.1), both models achieve the same loglikelihoods on both models at convergence. This shows that the two models are indistinguishable based on data sampled from the same distribution, even on test data.
B Proof of the ZeroGradient Proposition
Let us restate more formally and prove Proposition 1.
Proposition 0
Consider conditional probability modules ${P}_{{\theta}_{i}}\mathit{}\mathrm{(}{V}_{i}\mathrm{}\mathrm{pa}\mathit{}\mathrm{(}i\mathrm{,}V\mathrm{,}{B}_{i}\mathrm{)}\mathrm{)}$ where ${B}_{i\mathit{}j}\mathrm{=}\mathrm{1}$ indicates that ${V}_{j}$ is among the parents $\mathrm{pa}\mathit{}\mathrm{(}i\mathrm{,}V\mathrm{,}{B}_{i}\mathrm{)}$ of ${V}_{i}$ in a directed acyclic causal graph. Consider ground truth training distribution ${P}_{\mathrm{1}}$ and transfer distribution ${P}_{\mathrm{2}}$ over these variables, and ground truth causal structure $B$. The joint loglikelihood $\mathrm{L}\mathit{}\mathrm{(}V\mathrm{)}$ for a sample $V$ with respect to the module parameters $\theta $ decomposed into module parameters ${\theta}_{i}$ is $\mathrm{L}\mathit{}\mathrm{(}V\mathrm{)}\mathrm{=}{\mathrm{\sum}}_{i}\mathrm{log}\mathit{}{P}_{{\theta}_{i}}\mathit{}\mathrm{(}{V}_{i}\mathrm{}\mathrm{pa}\mathit{}\mathrm{(}i\mathrm{,}V\mathrm{,}{B}_{i}\mathrm{)}\mathrm{)}$. If (a) a model has the correct causal structure $B$, and (b) it been trained perfectly on ${P}_{\mathrm{1}}$, leading to estimated parameters $\theta $, and (c) the ground truth ${P}_{\mathrm{1}}$ and ${P}_{\mathrm{2}}$ only differ from each other only for some $P\mathit{}\mathrm{(}{V}_{i}\mathrm{}\mathrm{pa}\mathit{}\mathrm{(}i\mathrm{,}V\mathrm{,}{B}_{i}\mathrm{)}\mathrm{)}$ for $i\mathrm{\in}C$, then ${E}_{V\mathrm{\sim}{P}_{\mathrm{2}}}\mathit{}\mathrm{[}\frac{\mathrm{\partial}\mathit{}\mathrm{L}\mathit{}\mathrm{(}V\mathrm{)}}{\mathrm{\partial}\mathit{}{\theta}_{i}}\mathrm{]}\mathrm{=}\mathrm{0}$ for $i\mathrm{\notin}C$.
Proof Let ${V}_{i}$ be the subset of $V$ excluding ${V}_{i}$. We can simplify the expected gradient as follows.
${E}_{V\sim {P}_{2}}[$  $\frac{\partial \mathcal{L}(V)}{\partial {\theta}_{i}}}]=$  
$\sum _{V}}{P}_{2}(V){\displaystyle \sum _{k}}{\displaystyle \frac{\partial}{\partial {\theta}_{i}}}\mathrm{log}{P}_{{\theta}_{k}}({V}_{k}\mathrm{pa}(k,V,{B}_{k}))$  
$={\mathrm{\U0001d7d9}}_{i\in C}{\displaystyle \sum _{{V}_{i}}}{P}_{2}({V}_{i}){\displaystyle \sum _{{V}_{i}}}{P}_{2}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$\mathrm{\hspace{0.33em}}{\displaystyle \frac{\partial}{\partial {\theta}_{i}}}\mathrm{log}{P}_{{\theta}_{i}}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))+$  
${\mathrm{\U0001d7d9}}_{i\notin C}{\displaystyle \sum _{{V}_{i}}}{P}_{2}({V}_{i}){\displaystyle \sum _{{V}_{i}}}{P}_{1}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$\frac{\partial}{\partial {\theta}_{i}}}\mathrm{log}{P}_{{\theta}_{i}}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$ 
where the second equality is obtained because ${\theta}_{i}$ does not influence module $k\ne i$, and ${P}_{2}$ is the same ${P}_{1}$ for conditionals with $i\notin C$ (assumption (c)). Now for the special case of $i\notin C$, we obtain
${E}_{V\sim {P}_{2}}\left[{\displaystyle \frac{\partial \mathcal{L}(V)}{\partial {\theta}_{i}}}\right]$  $={\displaystyle \sum _{{V}_{i}}}{P}_{2}({V}_{i}){\displaystyle \sum _{{V}_{i}}}{P}_{1}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$\mathrm{\hspace{1em}\hspace{1em}}{\displaystyle \frac{\partial}{\partial {\theta}_{i}}}\mathrm{log}{P}_{{\theta}_{i}}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$={\displaystyle \sum _{{V}_{i}}}{P}_{2}({V}_{i}){\displaystyle \sum _{{V}_{i}}}{P}_{{\theta}_{i}}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$\mathrm{\hspace{1em}\hspace{1em}}{\displaystyle \frac{\partial}{\partial {\theta}_{i}}}\mathrm{log}{P}_{{\theta}_{i}}({V}_{i}\mathrm{pa}(i,V,{B}_{i}))$  
$=0$  (12) 
where the second equality arises from assumption (b), and the last line from zeroing the inner sum via the general identity
$$\sum _{v}{p}_{\theta}(v)\frac{\partial}{\partial \theta}\mathrm{log}{p}_{\theta}(v)=\frac{\partial}{\partial \theta}\sum _{v}{p}_{\theta}(v)=\frac{\partial 1}{\partial \theta}=0.$$ 
C PseudoCode
\SetKwRepeatRepeatRepeat \DontPrintSemicolon
D Proof of the Structural Parameter Gradient Proposition
Let us restate more formally and prove Proposition 2.
Proposition 0
The gradient of the negative loglikelihood regret of the transfer data
$$\mathcal{R}=\mathrm{log}\left[\mathrm{sigmoid}(\gamma ){\mathcal{L}}_{A\to B}+(1\mathrm{sigmoid}(\gamma )){\mathcal{L}}_{B\to A}\right]$$ 
with respect to the structural parameter $\gamma $ (where $\sigma \mathrm{(}\gamma \mathrm{)}\mathrm{=}P\mathrm{(}A\mathrm{\to}B\mathrm{)}$) is given by
$$\frac{\partial \mathcal{R}}{\partial \gamma}=\sigma (\gamma )P(A\to B\mid {D}_{2}),$$  (13) 
where ${D}_{\mathrm{2}}$ is the transfer data, and $P\mathrm{(}A\mathrm{\to}B\mathrm{\mid}{D}_{\mathrm{2}}\mathrm{)}$ is the posterior probability of the hypothesis $A\mathrm{\to}B$ (when the alternative is $B\mathrm{\to}A$), defined by applying Bayes rule to $P\mathrm{(}{D}_{\mathrm{2}}\mathrm{\mid}A\mathrm{\to}B\mathrm{)}\mathrm{=}{\mathrm{\prod}}_{t\mathrm{=}\mathrm{1}}^{T}P\mathrm{(}{a}_{t}\mathrm{,}{b}_{t}\mathrm{}A\mathrm{\to}B\mathrm{,}{\theta}_{t}\mathrm{)}\mathrm{=}{\mathrm{L}}_{A\mathrm{\to}B}$. Furthermore, this can be equivalently written as
$$\frac{\partial \mathcal{R}}{\partial \gamma}=\sigma (\gamma )\sigma (\gamma +\mathrm{\Delta}),$$  (14) 
where $\mathrm{\Delta}\mathrm{=}\mathrm{log}\mathit{}{\mathrm{L}}_{A\mathrm{\to}B}\mathrm{}\mathrm{log}\mathit{}{\mathrm{L}}_{B\mathrm{\to}A}$ is the difference between the loglikelihoods of the two hypotheses on transfer data ${D}_{\mathrm{2}}$.
Proof First note that, using Bayes rule,
$P(A\to B\mid {D}_{2})=$  
$\frac{P({D}_{2}\mid A\to B)P(A\to B)}{P({D}_{2}\mid A\to B)P(A\to B)+P({D}_{2}\mid B\to A)P(B\to A)}$  
$={\displaystyle \frac{{\mathcal{L}}_{A\to B}\sigma (\gamma )}{{\mathcal{L}}_{A\to B}\sigma (\gamma )+{\mathcal{L}}_{B\to A}(1\sigma (\gamma ))}}$  
$={\displaystyle \frac{\sigma (\gamma ){\mathcal{L}}_{A\to B}}{M}},$  (15) 
where $M=\sigma (\gamma ){\mathcal{L}}_{A\to B}+(1\sigma (\gamma )){\mathcal{L}}_{B\to A}$ is the online likelihood of the transfer data under the mixture, so that the regret is $\mathcal{R}=\mathrm{log}M$. For the second line above, note that
$P({D}_{2}A\to B)$  $={\displaystyle \prod _{t=1}^{T}}P({a}_{t},{b}_{t}A\to B,{\{({a}_{s},{b}_{s})\}}_{s=1}^{t1})$  
$={\displaystyle \prod _{t=1}^{T}}P({a}_{t},{b}_{t}A\to B,{\theta}_{t})={\mathcal{L}}_{A\to B}$  (16) 
where ${\theta}_{t}$ encapsulates the information about ${\{({a}_{s},{b}_{s})\}}_{s=1}^{t1})$ (through some adaptation procedure). Since we only consider the two hypotheses $A\to B$ and $B\to A$, we also have $P(B\to A\mid {D}_{2})=\frac{(1\sigma (\gamma )){\mathcal{L}}_{B\to A}}{M}=1P(A\to B\mid {D}_{2})$. Then
$\frac{\partial \mathcal{R}}{\partial \gamma}$  $={\displaystyle \frac{\sigma (\gamma )(1\sigma (\gamma )){\mathcal{L}}_{A\to B}\sigma (\gamma )(1\sigma (\gamma )){\mathcal{L}}_{B\to A}}{M}}$  
$=\sigma (\gamma )P(B\to A\mid {D}_{2})$  
$\mathrm{\hspace{1em}\hspace{1em}}(1\sigma (\gamma ))P(A\to B\mid {D}_{2})$  
$=\sigma (\gamma )+\sigma (\gamma )P(A\to B\mid {D}_{2})$  
$\mathrm{\hspace{1em}\hspace{1em}}P(A\to B\mid {D}_{2})\sigma (\gamma )P(A\to B\mid {D}_{2})$  
$=\sigma (\gamma )P(A\to B\mid {D}_{2})$  (17) 
which concludes the first part of the proof. Moreover, in order to prove the equivalent formulation in Equation (14), it is sufficient to prove that $P(A\to B\mid {D}_{2})=\sigma (\gamma +\mathrm{\Delta})$. Using the logit function ${\sigma}^{1}(z)=\mathrm{log}\frac{z}{1z}$, and the expression in Equation (15), we have
${\sigma}^{1}(P(A\to B\mid {D}_{2}))$  $=\mathrm{log}{\displaystyle \frac{\sigma (\gamma ){\mathcal{L}}_{A\to B}}{M\sigma (\gamma ){\mathcal{L}}_{A\to B}}}$  
$=\mathrm{log}{\displaystyle \frac{\sigma (\gamma ){\mathcal{L}}_{A\to B}}{(1\sigma (\gamma )){\mathcal{L}}_{B\to A}}}$  
$=\underset{=\gamma}{\underset{\u23df}{\mathrm{log}{\displaystyle \frac{\sigma (\gamma )}{1\sigma (\gamma )}}}}+$  
$\mathrm{\hspace{1em}\hspace{1em}}\underset{=\mathrm{\Delta}}{\underset{\u23df}{\mathrm{log}{\mathcal{L}}_{A\to B}\mathrm{log}{\mathcal{L}}_{B\to A}}}$  
$=\gamma +\mathrm{\Delta}$  (18) 
E Proof of the Proposition on the Convergence Point of Gradient Descent on the Structural Parameter
We use the same notation as in the above proof and statement.
Proposition 0
Stochastic gradient descent (with appropriately decreasing learning rate) on ${E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{R}\mathrm{]}$, with $\mathrm{R}\mathrm{=}\mathrm{}\mathrm{log}\mathit{}\mathrm{\left[}\mathrm{sigmoid}\mathit{}\mathrm{(}\gamma \mathrm{)}\mathit{}{\mathrm{L}}_{A\mathrm{\to}B}\mathrm{+}\mathrm{(}\mathrm{1}\mathrm{}\mathrm{sigmoid}\mathit{}\mathrm{(}\gamma \mathrm{)}\mathrm{)}\mathit{}{\mathrm{L}}_{B\mathrm{\to}A}\mathrm{\right]}$ and with steps following $\frac{\mathrm{\partial}\mathit{}\mathrm{R}}{\mathrm{\partial}\mathit{}\gamma}$ converges towards $\mathrm{sigmoid}\mathit{}\mathrm{(}\gamma \mathrm{)}\mathrm{=}\mathrm{1}$ if ${E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{log}\mathit{}{\mathrm{L}}_{A\mathrm{\to}B}\mathrm{]}\mathrm{>}{E}_{{D}_{\mathrm{2}}}\mathit{}\mathrm{[}\mathrm{log}\mathit{}{\mathrm{L}}_{B\mathrm{\to}A}\mathrm{]}$, or $\sigma \mathit{}\mathrm{(}\gamma \mathrm{)}\mathrm{=}\mathrm{0}$ otherwise.
Proof We are going to consider the fixed point of gradient descent when the gradient is zero, since we already know that SGD converges with an appropriately decreasing learning rate. Let us introduce some notation to simplify the algebra: $p=\mathrm{sigmoid}(\gamma )$, $M=p{\mathcal{L}}_{A\to B}+(1p){\mathcal{L}}_{B\to A}$, so $\mathcal{R}=\mathrm{log}M$, and define ${P}_{1}=\frac{p{\mathcal{L}}_{A\to B}}{M}=P(A\to B\mid {D}_{2})$, and ${P}_{2}=\frac{(1p){\mathcal{L}}_{B\to A}}{M}=1{P}_{1}$. Framing the stationary point in terms of $p$ rather than $\gamma $ gives us the inequality constraints $p\le 0$ and $p1\le 0$ and no equality constraint. Applying the KKT conditions with constraint functions $p$ and $p1$ gives us
${E}_{{D}_{2}}\left[{\displaystyle \frac{\partial \mathcal{R}}{\partial p}}\right]$  $=$  ${\mu}_{1}+{\mu}_{2}$  
${\mu}_{i}$  $\ge $  $0$  
${\mu}_{1}p$  $=$  $0$  
${\mu}_{2}(p1)$  $=$  $0$  (19) 
We already see from the last two equations that if $p\in (0,1)$ (i.e. excluding 0 and 1), we must have ${\mu}_{1}={\mu}_{2}=0$, i.e., $E[\frac{\partial \mathcal{R}}{\partial p}]=0$ (with drop the ${D}_{2}$ subscript on $E$ when it is clear from context). Let us study that case first and show that it leads to an inconsistent set of equations (thus forcing the solution to be either $p=0$ or $p=1$). Let us rewrite the gradient to highlight $p$ in it:
$\frac{\partial \mathcal{R}}{\partial p}}=$  $P(A\to B\mid {D}_{2})p$  
$=$  $\frac{p{\mathcal{L}}_{A\to B}}{p{\mathcal{L}}_{A\to B}+(1p){\mathcal{L}}_{B\to A}}}p$  
$=$  $\frac{p{\mathcal{L}}_{A\to B}p(p{\mathcal{L}}_{A\to B}+(1p){\mathcal{L}}_{B\to A})}{M}$  
$=$  $\frac{p(1p){\mathcal{L}}_{A\to B}p(1p){\mathcal{L}}_{B\to A}}{M}$  
$=$  $p(1p){\displaystyle \frac{{\mathcal{L}}_{A\to B}{\mathcal{L}}_{B\to A}}{M}}$  (20) 
The KKT conditions with the above two inequality constraints for $0\le p\le 1$ give
$E\left[{\displaystyle \frac{\partial \mathcal{R}}{\partial p}}\right]$  $={\mu}_{2}{\mu}_{1}.$  (21) 
If we consider the solutions $p\in (0,1)$ (i.e., ${\mu}_{1}={\mu}_{2}=0$) we now show that we get a contradiction. First note that to satisfy the above equation with ${\mu}_{1}={\mu}_{2}=0$ means that either $p=0$ or $p=1$ (which is inconsistent with the assumption that $p\in (0,1)$) or that $E\left[\frac{{\mathcal{L}}_{A\to B}{\mathcal{L}}_{B\to A}}{M}\right]=0$. Let us consider that equation, and since $p\ne 0$ and $p\ne 1$ we can either multiply by $p$ or by $1p$ on both sides. Assuming $p\ne 0$ and multiplying by $p$ gives
$0$  $=E\left[{\displaystyle \frac{p({\mathcal{L}}_{A\to B}{\mathcal{L}}_{B\to A})}{M}}\right]=E\left[{P}_{1}{\displaystyle \frac{p{\mathcal{L}}_{B\to A}}{M}}\right]$  
$=E\left[{P}_{1}{\displaystyle \frac{{\mathcal{L}}_{B\to A}{\mathcal{L}}_{B\to A}p{\mathcal{L}}_{B\to A}}{M}}\right]$  
$=E\left[{P}_{1}+{\displaystyle \frac{{\mathcal{L}}_{B\to A}}{M}}{P}_{2}\right]$  
$=E\left[{P}_{1}+{\displaystyle \frac{{\mathcal{L}}_{B\to A}}{M}}(1{P}_{1})\right]=E\left[{\displaystyle \frac{{\mathcal{L}}_{B\to A}}{M}}1\right].$  (22) 
For this equation to be satisfied, we need ${\mathcal{L}}_{B\to A}=M$ all the time, since ${\mathcal{L}}_{B\to A}\le M$ by construction. This would however correspond to $p=0$. Similarly, assuming $p\ne 1$ we can multiply the stationarity equation by $1p$ and get
$0$  $=E\left[{\displaystyle \frac{(1p)({\mathcal{L}}_{A\to B}{\mathcal{L}}_{B\to A})}{M}}\right]$  
$=E\left[{\displaystyle \frac{(1p){\mathcal{L}}_{A\to B}}{M}}{P}_{2}\right]$  
$=E\left[{\displaystyle \frac{{\mathcal{L}}_{A\to B}}{M}}{P}_{1}(1{P}_{1})\right]=E\left[{\displaystyle \frac{{\mathcal{L}}_{A\to B}}{M}}1\right]$  (23) 
Again, this can only be 0 if ${\mathcal{L}}_{A\to B}=M$ all the time, i.e., $p=1$. We conclude that the solutions $p\in (0,1)$ are not possible because they would lead to inconsistent conclusions, which leaves only $p=0$ or $p=1$. When $p=0$ we have $E[\mathcal{R}]=E[\mathrm{log}{\mathcal{L}}_{A\to B}]$, and when $p=1$ we have $E[\mathcal{R}]=E[\mathrm{log}{\mathcal{L}}_{B\to A}]$. Thus the minimum will be achieved at $p=1$ when ${E}_{{D}_{2}}[\mathrm{log}{\mathcal{L}}_{A\to B}]>{E}_{{D}_{2}}[\mathrm{log}{\mathcal{L}}_{B\to A}]$, or $p=0$ otherwise.
F More Than Two Causal Hypotheses
In this section, we consider one approach to generalize to more than two causal structures. We consider $m$ variables, corresponding to $O({2}^{{m}^{2}})$ possible causal graphs, since each variable ${V}_{j}$ could be (or not) a direct cause of any variable ${V}_{i}$, leading to ${m}^{2}$ binary decisions. Note that a causal graph can in principle have cycles (if time is not measured with sufficient precision), although having a directed acyclic graph allows a much simpler sampling procedure (ancestral sampling). In our experiments the ground truth graph will always be directed, to make sampling easier and faster, but the learning procedure will not directly assume that. Motivated by the mechanism independence assumption, we propose a heuristic to learn the causal graph in which we independently parametrize the binary probability ${p}_{ij}$ that ${V}_{j}$ is a parent (direct cause) of ${V}_{i}$. As was the case for Section 2, we parametrize this Binomial distribution via binary edges ${B}_{ij}$ that specify the graph structure:
${B}_{ij}$  $\sim \mathrm{Bernoulli}({p}_{ij}),$  
$P(B)$  $={\displaystyle \prod _{ij}}P({B}_{ij}).$  (24) 
where ${p}_{ij}=\mathrm{sigmoid}({\gamma}_{ij})$. Let us define the parents of ${V}_{i}$, given $B$, as the set of ${V}_{j}$’s such that ${B}_{ij}=1$:
$$\mathrm{pa}(i,V,{B}_{i})=\{{V}_{j}\mid {B}_{ij}=1,j\ne i\}$$  (25) 
where ${B}_{i}$ is the bit vector with elements ${B}_{ij}$ (and ${B}_{ii}=0$ is ignored). Similarly, we could parametrize the causal graph with a structural causal model where some of the inputs (from variable $j$) of each function (for variable $i$) can be ignored with some probability ${p}_{ij}$:
${V}_{i}={f}_{i}({\theta}_{i},{B}_{i},V,{N}_{i})$  (26) 
where ${N}_{i}$ is an independent noise source to generate ${V}_{i}$ and ${f}_{i}$ parametrizes the generator (as in a GAN), while not being allowed to use variable ${V}_{j}$ unless ${B}_{ij}=1$ (and of course not being allowed to use ${V}_{i}$). We can consider that ${f}_{i}$ is a kind of neural network similar to the denoising autoencoders or with dropout on the input, where ${B}_{i}$ is a binary mask vector that prevents ${f}_{i}$ from using some of the ${V}_{j}$’s (for which ${B}_{ij}=0$).
The conditional likelihood ${P}_{{B}_{i}}({V}_{i}={v}_{ti}\mid \mathrm{pa}(i,{v}_{t},{B}_{i}))$ measures how well the model that uses the incoming edges ${B}_{i}$ for node $i$ performs for example ${v}_{t}$. We build a multiplicative (or exponentiated) form of regret by multiplying these likelihoods as ${\theta}_{t}$ changes during an adaptation episode, for node $i$:
$${\mathcal{L}}_{{B}_{i}}=\prod _{t}{P}_{{B}_{i}}({V}_{i}={v}_{ti}\mid \mathrm{pa}(i,{v}_{t},{B}_{i})).$$  (27) 
The overall exponentiated regret for the given graph structure $B$ is ${\mathcal{L}}_{B}={\prod}_{i}{\mathcal{L}}_{{B}_{i}}$. Similarly to the bivariate case, we want to consider a mixture over all the possible graph structures, but where each component must explain the whole adaptation sequence, thus we define as a loss for the generalized multivariable case
$\mathcal{R}$  $=\mathrm{log}{E}_{B}[{\mathcal{L}}_{B}]$  (28) 
Note the expectation over the ${2}^{{m}^{2}}$ possible values of $B$, which is intractable. However, we can still get an efficient stochastic gradient estimator, which can be computed separately for each node of the graph (with samples arising only out of ${B}_{i}$, the incoming edges into ${V}_{i}$):
Proposition 1
The overall regret (Equation (28)) rewrites
$$\mathcal{R}=\sum _{i}\mathrm{log}\sum _{{B}_{i}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}$$  (29) 
and if we are willing to consider multiple samples of $B$ in parallel, a biased but asymptotically unbiased (as the number $K$ of these samples ${B}^{\mathrm{(}k\mathrm{)}}$ increases to infinity) estimator of the gradient of the overall regret with respect to metaparameters can be defined:
$${g}_{ij}=\frac{{\sum}_{k}(\sigma ({\gamma}_{ij}){B}_{ij}^{(k)}){\mathcal{L}}_{{B}_{i}}^{(k)}}{{\sum}_{k}{\mathcal{L}}_{{B}_{i}}^{(k)}}$$  (30) 
where the ${}^{\mathrm{(}k\mathrm{)}}$ index indicates the values obtained for the $k$th draw of $B$.
Proof Recall that ${\mathcal{L}}_{B}={\prod}_{i}{\mathcal{L}}_{{B}_{i}}$ so we can rewrite the regress loss as follows:
$\mathcal{R}$  $=\mathrm{log}{E}_{B}[{\mathcal{L}}_{B}]$  
$=\mathrm{log}{\displaystyle \sum _{B}}P(B){\mathcal{L}}_{B}$  
$=\mathrm{log}{\displaystyle \sum _{{B}_{1}}}{\displaystyle \sum _{{B}_{2}}}\mathrm{\dots}{\displaystyle \sum _{{B}_{M}}}{\displaystyle \prod _{i}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}$  
$=\mathrm{log}{\displaystyle \prod _{i}}\left({\displaystyle \sum _{{B}_{i}}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}\right)$  
$={\displaystyle \sum _{i}}\mathrm{log}{\displaystyle \sum _{{B}_{i}}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}$  (31) 
So the regret gradient on metaparameters ${\gamma}_{i}$ of node $i$ is
$\frac{\partial \mathcal{R}}{\partial {\gamma}_{i}}$  $={\displaystyle \frac{{\sum}_{{B}_{i}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}\frac{\partial \mathrm{log}P({B}_{i}){\mathcal{L}}_{{B}_{i}}}{\partial {\gamma}_{i}}}{{\sum}_{{B}_{i}}P({B}_{i}){\mathcal{L}}_{{B}_{i}}}}$  
$={\displaystyle \frac{{E}_{{B}_{i}}[{\mathcal{L}}_{{B}_{i}}\frac{\partial \mathrm{log}P({B}_{i})}{\partial {\gamma}_{i}}]}{{E}_{{B}_{i}}[{\mathcal{L}}_{{B}_{i}}]}}$  (32) 
Note that with the sigmoidal parametrization of $P({B}_{ij})$,
$$\mathrm{log}P({B}_{ij})={B}_{ij}\mathrm{log}\mathrm{sigmoid}({\gamma}_{ij})+(1{B}_{ij})\mathrm{log}(1\mathrm{sigmoid}({\gamma}_{ij}))$$ 
as in the crossentropy loss. Its gradient can similarly be simplified to
$\frac{\partial \mathrm{log}P({B}_{ij})}{\partial {\gamma}_{ij}}$  $={\displaystyle \frac{{B}_{ij}}{\mathrm{sigmoid}({\gamma}_{ij})}}\mathrm{sigmoid}({\gamma}_{ij})(1\mathrm{sigmoid}({\gamma}_{ij}))$  
$\mathrm{\hspace{0.33em}}{\displaystyle \frac{(1{B}_{ij})}{(1\mathrm{sigmoid}({\gamma}_{ij}))}}\mathrm{sigmoid}({\gamma}_{ij})(1\mathrm{sigmoid}({\gamma}_{ij})))$  
$={B}_{ij}\mathrm{sigmoid}({\gamma}_{ij})$  (33) 
A biased but asymptotically unbiased estimator of $\frac{\partial \mathcal{R}}{\partial {\gamma}_{ij}}$ is thus obtained by sampling $K$ graphs (over which the means below are run):
${g}_{ij}={\displaystyle \sum _{k}}(\sigma ({\gamma}_{ij}){B}_{ij}^{(k)}){\displaystyle \frac{{\mathcal{L}}_{{B}_{i}^{(k)}}}{{\sum}_{{k}^{\prime}}{\mathcal{L}}_{{B}_{i}^{({k}^{\prime})}}}}$  (34) 
where index ${}^{(k)}$ indicates the $k$th draw of $B$, and we obtain a weighted sum of the individual binomial gradients weighted by the relative regret of each draw ${B}_{i}^{(k)}$ of ${B}_{i}$,
leading to Equation (30).
This decomposition is good news because the loss is a sum of independent
terms, one per node $i$, depending only of ${B}_{i}$ and and similarly ${g}_{ij}$ only
depends on ${B}_{i}$ rather than the
full graph structure. We use the
estimator from Equation (30) in the general pseudocode for metatransfer learning of causal structure displayed in Algorithm 1.
G Results on Learning which is Cause and which is Effect
In order to assess the performance of our metalearning algorithm, we applied it on generated data from three different domains: discrete random variables, multimodal continuous random variables and multivariate gaussiandistributed variables. In this section, we describe the setups for all three experiments, along with additional results to complement the results described in the main text. Note that in all these experiments, we fix the structure of the groundtruth to be $A\to B$, and only perform interventions on the cause $A$.
G.1 Discrete variables and Two Causal Hypotheses
We consider a bivariate model, where both random variables are sampled from a categorical distribution. The underlying groundtruth model can be described as
$A$  $\sim \mathrm{Categorical}({\pi}_{A})$  
$B\mid A=a$  $\sim \mathrm{Categorical}({\pi}_{Ba}),$  (35) 
with ${\pi}_{A}$ is a probability vector of size $N$, and ${\pi}_{Ba}$ is a probability vector of size $N$, which depends on the value of the variable $A$. In our experiment, each random variable can take one of $N=10$ values. Since we are working with only two variables, the only two possible models are:

•
Model $A\mathrm{\to}B$: $P(A,B)=P(A)P(B\mid A)$

•
Model $B\mathrm{\to}A$: $P(A,B)=P(B)P(A\mid B)$
We build 4 different modules, corresponding to the model of each possible marginal and conditional distribution. These modules’ definition and their corresponding parameters are shown in Table G.1.
Distribution  Module  Parameters  Dimension  

Model $A\mathrm{\to}B$  $P(A)$  $P({x}_{A}=i;{\theta}_{A})={[\mathrm{softmax}({\theta}_{A})]}_{i}$  ${\theta}_{A}$  $N$ 
$P(B\mid A)$  $P({x}_{B}=j\mid {x}_{A}=i;{\theta}_{BA})={[\mathrm{softmax}({\theta}_{BA}(i))]}_{j}$  ${\theta}_{BA}$  ${N}^{2}$  
Model $B\mathrm{\to}A$  $P(B)$  $P({x}_{B}=j;{\theta}_{B})={[\mathrm{softmax}(v)]}_{j}$  ${\theta}_{B}$  $N$ 
$P(A\mid B)$  $P({x}_{A}=i\mid {x}_{B}=j;{\theta}_{AB})={[\mathrm{softmax}({V}_{j})]}_{i}$  ${\theta}_{AB}$  ${N}^{2}$ 
In order to get a set of initial parameters, we first train all 4 modules on a training distribution. This training distribution corresponds to a fixed choice of ${\pi}_{A}^{(1)}$ and ${\pi}_{Ba}$ (for all $N$ possible values of $a$). Note that the superscript in ${\pi}_{A}^{(1)}$ emphasizes the fact that this defines the distribution prior to intervention, with the mechanism $P(B\mid A)$ being unchanged by the intervention. These probability vectors are sampled randomly from a uniform Dirichlet distribution
${\pi}_{A}^{(1)}$  $\sim \mathrm{Dirichlet}({\mathrm{\U0001d7cf}}_{N})$  
${\pi}_{Ba}$  $\sim \mathrm{Dirichlet}({\mathrm{\U0001d7cf}}_{N})\mathit{\hspace{1em}\hspace{1em}}\forall a\in [1,N].$  (36) 
Given this initial training distribution, we can sample a large dataset of training examples ${\{({a}_{i},{b}_{i})\}}_{i=1}^{n}$ from the groundtruth model, using ancestral sampling.
$a$  $\sim \mathrm{Categorical}({\pi}_{A}^{(1)})$  
$b$  $\sim \mathrm{Categorical}({\pi}_{Ba}).$  (37) 
Using this large dataset from the training distribution, we can train all 4 modules using gradient descent, or any other advanced firstorder optimizer, like RMSprop. The parameters ${\theta}_{A}$, ${\theta}_{BA}$, ${\theta}_{B}$ & ${\theta}_{AB}$ of the different modules found after this initial training will be used as the initial parameters for the adaptation on a new transfer distribution.
Similar to the way we defined the training distribution, we can define a transfer distribution as a soft intervention on the random variable $A$. In this experiment, this accounts for changing the distribution of $A$, that is with a new probability vector ${\pi}_{A}^{(2)}$, also sampled randomly from a uniform Dirichlet distribution
${\pi}_{A}^{(2)}$  $\sim \mathrm{Dirichlet}({\mathrm{\U0001d7cf}}_{N})$  (38) 
To perform adaptation on the transfer distribution, we also sample a smaller dataset of transfer examples ${D}_{2}=\{{({a}_{i},{b}_{i}\}}_{i=1}^{m}$, with $m\ll n$ the size of the training set. In our experiment, we used $m=20$ transfer examples. We also used ancestral sampling on this new transfer distribution to acquire samples, similar to Equation (37) (with ${\pi}_{A}^{(2)}$ instead of ${\pi}_{A}^{(1)}$).
Starting from the parameters estimated after the initial training on the training distribution, we perform a few steps of adaptation on the modules parameters ${\theta}_{A}$, ${\theta}_{BA}$, ${\theta}_{B}$ & ${\theta}_{AB}$ using $T$ steps of gradient descent based on the transfer dataset ${D}_{2}$. The value of the likelihoods for both models is recorded as well, and computed as
${\mathcal{L}}_{A\to B}$  $={\displaystyle \prod _{t=1}^{T}}P({\mathbf{a}}_{t}\mid {\theta}_{A}^{(t)})P({\mathbf{b}}_{t}\mid {\mathbf{a}}_{t};{\theta}_{BA}^{(t)})$  
${\mathcal{L}}_{B\to A}$  $={\displaystyle \prod _{t=1}^{T}}P({\mathbf{b}}_{t}\mid {\theta}_{B}^{(t)})P({\mathbf{a}}_{t}\mid {\mathbf{b}}_{t};{\theta}_{AB}^{(t)}),$  (39) 
where $({\mathbf{a}}_{t},{\mathbf{b}}_{t})$ represents a minibatch of examples from ${D}_{2}$, and the superscript $t$ on the parameters highlights the fact that these likelihoods are computed after $t$ steps of adaptation. This product over $t$ ensures that we monitor the progress of adaptation along the whole trajectory. In this experiment, we used $T=2$ steps of gradient descent on minibatch of size 10 for the adaptation.
Finally, in order to update the structural parameter $\gamma $, we can use Proposition 2 to compute the gradient of the loss $L$ with respect to $\gamma $:
$\mathcal{R}(\gamma )$  $=\mathrm{log}[\sigma (\gamma ){\mathcal{L}}_{A\to B}+(1\sigma (\gamma )){\mathcal{L}}_{B\to A}]$  (40)  
$\frac{\partial \mathcal{R}}{\partial \gamma}$  $=\sigma (\gamma +\mathrm{\Delta})\sigma (\gamma ),$  (41) 
where $\mathrm{\Delta}=\mathrm{log}{\mathcal{L}}_{A\to B}\mathrm{log}{\mathcal{L}}_{B\to A}$. The update of $\gamma $ can be one step of gradient descent, or using any firstorder optimizer like RMSprop. We perform multiple interventions over the course of metatraining by sampling multiple transfer distributions, and following the same steps of adaptation and update of the structural parameter $\gamma $.
In Figure 2, we report the evolution of the structural parameter $\gamma $ (or rather, $\sigma (\gamma )$) as a function of the number of metatraining steps or, similarly, the number of different interventions made on the causal model. The model’s belief $P(A\to B)=\sigma (\gamma )$ indeed converges to $1$, proving that the algorithm was capable of recovering the correct causal direction $A\to B$.
G.2 Discrete Variables with MLP Parametrization
We consider a bivariate model similar to the ones defined above, where each random variable is sampled from a categorical distribution. Instead of expressing probabilities in a tabular form, we train $M=2$ simple feedfoward neural networks (MLP), one per conditional variable. MLP $i$ is the independent mechanism of causal variable $i$ that determines the conditional probability of the $N$ discrete choices for variable $i$, given its parents.
Each MLP receives $M$ concatenated $N$dimensional onehot vectors, masked appropriately according to the chosen causal structure $B$, with each directed edge presence indicated by ${B}_{ij}$, and ${B}_{ij}=1$ if variable $j$ is a direct causal parent of variable $i$. The MLP maps the $MN$ input units through one hidden layer that contains $H=4M$ hidden units and a ReLU nonlinearity, and then maps the $H$ hidden units to $N$ output units and a softmax representing a predicted categorical distribution.
The causal structure belief is specified by an $M\times M$ matrix $\gamma $, with $\sigma ({\gamma}_{ij})$ the estimated probability that variable $i$ is directly caused by variable $j$. The causal structure ${B}_{ij}$ is drawn from $\mathrm{Ber}({\gamma}_{ij})$, as per Algorithm 1. We generalize the estimator introduced for the 2hypotheses case as per Appendix F, i.e., we use the gradient estimator in Equation 30.
To evaluate the correctness of the structure being learnt, we measure the cross entropy between the groundtruth SCM and the learned SCM. In Figure 3 we show this crossentropy over different episodes of training for bivariate discrete distributions with either 10 categories or 100 categories. Both models are first pretrained for 100 examples with fully connected edges before starting training on the transfer distributions.
G.3 Continuous Multimodal Variables
Consider a family of joint distributions ${P}_{\mu}(A,B)$ over the causal variables $A$ and $B$ sampled from the structural causal model (SCM):
$A$  $\sim {P}_{\mu}(A)=\mathcal{N}(\mu ,{\sigma}^{2}=4)$  
$B$  $:=f(A)+{N}_{B}\mathit{\hspace{1em}\hspace{1em}}{N}_{B}\sim \mathcal{N}(\mu =0,{\sigma}^{2}=1)$  (42) 
where $f$ is a randomly generated spline and ${N}_{B}$ is sampled i.i.d from the unitnormal distribution.
To obtain the spline, we sample the $K$ points ${\{{x}_{k}\}}_{k=1}^{K}$ uniformly spaced from the interval $[{R}_{A},{R}_{A}]$, and another $K$ points ${\{{y}_{k}\}}_{k=1}^{K}$ uniform randomly from the interval $[{R}_{B},{R}_{B}]$. This yields $K$ pairs ${\{({x}_{k},{y}_{k})\}}_{k=1}^{K}$, which make the knots of a secondorder spline. We set $K=8$, ${R}_{A}={R}_{B}=8$ for our experiments. In Figure G.1, we plot samples from one such SCM for the training distribution ($\mu =0$) and two transfer distributions ($\mu =\pm 4$).
The conditionals $P(B\mid A;{\theta}_{BA})$ and $P(A\mid B;{\theta}_{AB})$ are parameterized as 2layer Mixture Density Networks Bishop (1994) with $32$ hidden units and $10$ components. The marginals $P(A\mid {\theta}_{A})$ and $P(B\mid {\theta}_{B})$ are parameterized as Gaussian Mixture Models, also with $10$ components. The training now follows as described below.
Similar to Appendix G.1, we first pretrain the modules corresponding to the conditionals and marginals on the training distribution. To that end, we select ${P}_{\mu =0}(A,B)$ as the training distribution, sample a (large) training dataset ${\{({a}_{i},{b}_{i})\}}_{i=1}^{n}$ from it using ancestral sampling, and solve the following two problems independently until convergence:
$\underset{{\theta}_{A},{\theta}_{BA}}{\mathrm{max}}$  $\sum _{i=1}^{n}}\mathrm{log}P({a}_{i}\mid {\theta}_{A})P({b}_{i}\mid {a}_{i};{\theta}_{BA})$  (43)  
$\underset{{\theta}_{B},{\theta}_{AB}}{\mathrm{max}}$  $\sum _{i=1}^{n}}\mathrm{log}P({b}_{i}\mid {\theta}_{B})P({a}_{i}\mid {b}_{i};{\theta}_{AB})$  (44) 
The adaptation performance of $A\to B$ and $B\to A$ models can now be evaluated on transfer distributions. For a $\mu $ sampled uniformly in $[4,4]$, we select ${P}_{\mu}({A}^{\prime},{B}^{\prime})$ as the transfer distribution, and denote with $({A}^{\prime},{B}^{\prime})$ samples from it. Both models are finetuned on ${P}_{\mu}({A}^{\prime},{B}^{\prime})$ for $T=10$ iterations (see Algorithm 1), and the area under the corresponding negativeloglikelihood curves becomes the regret:
${\mathcal{R}}_{A\to B}={\displaystyle \sum _{t=1}^{T}}\mathrm{log}P({B}^{\prime}{A}^{\prime};{\theta}_{A\to B}^{(t)})P({A}^{\prime}{\theta}_{A\to B}^{(T)})$  (45) 
and likewise for ${\mathcal{R}}_{B\to A}$. In these experiments, the modules corresponding to the marginals (ie. GMM) are learned offline via Expectation Maximization, and we denote with $P({A}^{\prime}{\theta}_{A\to B}^{(T)})$ the trained model. These can now be used to define the following metaobjective for the structural metaparameter $\gamma $:
$\mathcal{R}(\gamma )=\mathrm{log}[\sigma (\gamma ){e}^{{\mathcal{R}}_{A\to B}}+(1\sigma (\gamma )){e}^{{\mathcal{R}}_{B\to A}}]$  (46) 
The structural regret $\mathcal{R}(\gamma )$ is now minimized with respect to $\gamma $ for 200 iterations (updates of $\gamma $). Figure G.2 shows the evolution of $\sigma (\gamma )$ as training progresses. This is expected, given that we expect the causal model to perform better on the transfer distributions, i.e. we expect $$ in expection. Consequently, assigning a larger weight to ${\mathcal{R}}_{A\to B}$ optimizes the objective.
G.4 Linear Gaussian Model
In this experiment, the two variables we consider are vectors (i.e. $A\in {\mathbb{R}}^{d}$ and $B\in {\mathbb{R}}^{d}$). The ground truth causal model is given by
$A$  $\sim \mathcal{N}({\mu}_{A},{\mathrm{\Sigma}}_{A})$  
$B$  $:={\beta}_{1}A+{\beta}_{0}+{N}_{B}\mathit{\hspace{1em}\hspace{1em}}{N}_{B}\sim \mathcal{N}(0,{\mathrm{\Sigma}}_{B})$  (47) 
where ${\mu}_{A}\in {\mathbb{R}}^{d}$, ${\beta}_{0}\in {\mathbb{R}}^{d}$ and ${\beta}_{1}\in {\mathbb{R}}^{d\times d}$. ${\mathrm{\Sigma}}_{A}$ and ${\mathrm{\Sigma}}_{B}$ are $d\times d$ covariance matrices^{2}^{2} 2 Ground truth parameters ${\mu}_{A}$, ${\beta}_{1}$ and ${\beta}_{0}$ are sampled from a Gaussian distribution, while ${\mathrm{\Sigma}}_{A}$ and ${\mathrm{\Sigma}}_{B}$ are sampled from an inverse Wishart distribution.. In our experiments, $d=100$. Once again, we want to identify the correct causal direction between $A$ and $B$. To do so, we consider two models: $A\to B$ and $B\to A$. We parameterize both models symmetrically:
${P}_{A\to B}(A)$  $=\mathcal{N}(A;{\widehat{\mu}}_{A},{\widehat{\mathrm{\Sigma}}}_{A})$  
${P}_{A\to B}(B\mid A=a)$  $=\mathcal{N}(B;{\widehat{W}}_{1}a+{\widehat{W}}_{0},{\widehat{\mathrm{\Sigma}}}_{A\to B})$  
${P}_{B\to A}(B)$  $=\mathcal{N}(B;{\widehat{\mu}}_{B},{\widehat{\mathrm{\Sigma}}}_{B})$  
${P}_{B\to A}(A\mid B=b)$  $=\mathcal{N}(B;{\widehat{V}}_{1}b+{\widehat{V}}_{0},{\widehat{\mathrm{\Sigma}}}_{B\to A})$  (48) 
Note that each covariance matrix is parameterized using the Cholesky decomposition. Unlike previous experiments, we are not conducting any pretraining on actual data. Instead, we fix the parameters of both models to their exact values according to the ground truth parameters introduced in Equation G.4. For model $A\to B$, this can be done trivially. For the second model, we can compute its exact parameters analytically. Once the exact parameters are set, both models are equivalent in the sense that ${P}_{A\to B}(A,B)={P}_{B\to A}(A,B)$ $\forall A,B$.
Each metalearning episode starts by initializing the parameters of both models to the values identified during the pretraining. Afterward, a transfer distribution is sampled (i.e. ${\mu}_{A}\sim \mathcal{N}(0,I)$). Then, both models are trained on samples from this distribution, for 10 iterations only. During this adaptation, the loglikelihoods of both models are accumulated in order to compute ${\mathcal{L}}_{A\to B}$ and ${\mathcal{L}}_{B\to A}$. At this stage, we compute the meta objective estimate $\mathcal{R}=\mathrm{log}\left[\mathrm{sigmoid}(\gamma ){\mathcal{L}}_{A\to B}+(1\mathrm{sigmoid}(\gamma )){\mathcal{L}}_{B\to A}\right]$, compute its gradient w.r.t. $\gamma $ and update $\gamma $.
Figure G.3 shows that, after 200 episodes, $\sigma (\gamma )$ converges to 1, indicating the success of the method on this particular task. The good performance of the algorithm was dependent on $d$ being sufficiently large. We explain this observation by the fact that the parameter counting argument presented in Section 2.2.1 gets stronger as the $d$ grows. Indeed, model $A\to B$ needs to adapt only $O(d)$ scalar parameters, while model $B\to A$ needs to adapt $O({d}^{2})$ scalar parameters, even though both models have the same number of parameters.
H Results on Learning the Correct Encoder
The causal variables $(A,B)$ are sampled from the distribution described in Eqn G.3, and are mapped to observations $(X,Y)\sim {P}_{\mu}(X,Y)$ via a hidden (and a priori unknown) decoder $\mathcal{D}=R({\theta}_{\mathcal{D}})$, where $R$ is a rotation matrix. The observations are then mapped to the hidden state $(U,V)\sim {P}_{\mu}(U,V)$ via the encoder $\mathcal{E}=R({\theta}_{\mathcal{E}})$. The computational graph is depicted in Figure 4.
Analogous to Equation 46 in Appendix G.3, we now define the regret over the variables $(U,V)$ instead of $(A,B)$:
$\mathcal{R}(\gamma ,{\theta}_{\mathcal{E}})=\mathrm{log}[\sigma (\gamma ){e}^{{\mathcal{R}}_{U\to V}}+(1\sigma (\gamma )){e}^{{\mathcal{R}}_{V\to U}}]$  (49) 
where the dependence on ${\theta}_{\mathcal{E}}$ is implicit in $(U,V)$. In every metatraining iteration, the $U\to V$ and $V\to U$ models are trained on the training distribution ${P}_{\mu =0}(U,V)$ for ${T}^{\prime}=20$ iterations. Subsequently, the regrets ${\mathcal{R}}_{U\to V}$ and ${\mathcal{R}}_{V\to U}$ are obtained by a process identical to that described in Equation 45 of Appendix G.3 (albeit with variables $(U,V)$ and $T=5$). Finally, the gradients of $\mathcal{R}(\gamma ,{\theta}_{\mathcal{E}})$ are evaluated and the metaparameters $\gamma $ and ${\theta}_{\mathcal{E}}$ are updated. This process is repeated for $1000$ metaiterations, and Figure 5 shows the evolution of ${\theta}_{\mathcal{E}}$ as training progresses (where ${\theta}_{\mathcal{D}}$ has been set to $\frac{\pi}{4}$). Further, Figure H.1 shows the corresponding evolution of the structural parameter $\gamma $.