Abstract
We introduce a fully stochastic gradient based approach to Bayesian optimalexperimental design (BOED). This is achieved through the use of variationallower bounds on the expected information gain (EIG) of an experiment that canbe simultaneously optimized with respect to both the variational and designparameters. This allows the design process to be carried out through a singleunified stochastic gradient ascent procedure, in contrast to existingapproaches that typically construct an EIG estimator on a pointwise basis,before passing this estimator to a separate optimizer. We show that this, inturn, leads to more efficient BOED schemes and provide a number of a differentvariational objectives suited to different settings. Furthermore, we show thatour gradientbased approaches are able to provide effective design optimizationin substantially higher dimensional settings than existing approaches.
Quick Read (beta)
Abstract
We introduce a fully stochastic gradient based approach to \acrfullOED. This is achieved through the use of variational lower bounds on the \acrfullEIG of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct an \acrshortEIG estimator on a pointwise basis, before passing this estimator to a separate optimizer. We show that this, in turn, leads to more efficient \acrshortOED schemes and provide a number of a different variational objectives suited to different settings. Furthermore, we show that our gradientbased approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.
algocfAlgorithmAlgorithms \crefnamealgorithmAlgorithmAlgorithms \crefnamefigureFigureFigure \crefnamesection§§§ \Crefnamesection§§§ \crefformatequation(#2#1#3) \glsdisablehyper \newacronymKLklKullbackLeibler \newacronymSGDsgdstochastic gradient descent \newacronymOEDBOEDBayesian optimal experimental design \newacronymMIMImutual information \newacronymEIGEIGexpected information gain \newacronymNMCNMCnested Monte Carlo \newacronymACEACEadaptive contrastive estimation \newacronymNCENCEnoise contrastive estimation \newacronymBABABarberAgakov \newacronymBOBOBayesian optimization \newacronymMCMCMonte Carlo \newacronymRMSERMSEroot mean squared error
A Unified Stochastic Gradient Approach to Designing BayesianOptimal Experiments
Adam Foster^{†} Martin Jankowiak^{‡} Matthew O’Meara^{§} Yee Whye Teh^{†} Tom Rainforth^{†✠}
^{†}Department of Statistics University of Oxford Oxford, UK ^{‡}Uber AI San Francisco, CA, USA ^{§}University of Michigan Ann Arbor, MI, USA ^{✠}Christ Church University of Oxford Oxford, UK
1 INTRODUCTION
The design of experiments is a key problem in almost every scientific discipline. Namely, one wishes to construct an experiment that is most informative about the process being investigated, while minimizing its cost. For example, in a psychological trial, we want to ensure questions posed to participants are pertinent and do not have predictable responses. In a pharmaceutical trial, we want to minimize the number of participants needed to successfully test our hypotheses. In an online automated help system, we want to ensure we ask questions that identify the user’s problem as quickly as possible.
In all these scenarios, our ultimate highlevel aim is to choose designs that maximize the information gathered by the experiment. A powerful and broadly used approach for formalizing this aim is \acrfullOED (chaloner1995; lindley1956; myung2013). In \acrshortOED, we specify a Bayesian model for the experiment and then choose the design that maximizes the \acrfullEIG from running it. More specifically, let $\theta $ denote the latent variables we wish to learn about from running the experiment and let $\xi \in \mathrm{\Xi}$ represent the experimental design. By introducing a prior $p(\theta )$ and a predictive distribution $p(y\theta ,\xi )$ for experiment outcomes $y$, we can calculate the \acrshortEIG under this model by taking the expected reduction in posterior entropy
$$I(\xi )\triangleq {\mathbb{E}}_{p(y\xi )}\left[H[p(\theta )]H[p(\theta y,\xi )]\right],$$  (1) 
where $H[\cdot ]$ represents the entropy of a distribution and $p(\theta y,\xi )\propto p(y\theta ,\xi )p(\theta )$. Our experimental design process now becomes that of the finding the design ${\xi}^{*}$ that maximizes $I(\xi )$.
Unfortunately, finding ${\xi}^{*}$ is typically a very challenging problem in practice. Even evaluating $I(\xi )$ for a single design is computationally difficult because it represents a nested expectation and thus has no direct \acrlongMC estimator (nmc). Though a large variety of approaches for performing this estimation have been suggested (myung2013; watson2017quest+; kleinegesse2018efficient; foster2019variational), the resulting \acrshortOED strategies share a critical common feature: they estimate $I(\xi )$ on a pointbypoint basis and feed this estimator to an outerlevel optimizer that selects which points to evaluate.
This framework can be highly inefficient for a number of reasons. For example, it adds an extra level of nesting to the overall computation process: $I(\xi )$ must be separately estimated for each $\xi $, substantially increasing the overall computational cost. Furthermore, one must typically resort to gradientfree methods to carry out the resulting optimization, which means it is difficult to scale the overall \acrshortOED process to high dimensional design settings due to a dearth of optimization schemes which remain effective in such settings.
To alleviate these inefficiencies and open the door to applying \acrshortOED in highdimensional settings, we introduce an alternative to this twostage framework by introducing unified objectives that can be directly maximized to simultaneously estimate $I(\xi )$ and optimize $\xi $. Specifically, by building on the work of foster2019variational, we construct variational lower bounds to $I(\xi )$ that can be simultaneously optimized with respect to both the variational and design parameters. Optimizing the former ensures that we achieve a tight bound that in turn gives accurate estimates of $I(\xi )$, while simultaneously optimizing the latter circumvents the need for an expensive outer optimization process. Critically, this approach allows the optimization to be performed using stochastic gradient ascent (SGA) (robbins1951stochastic) and therefore scaled to substantially higher dimensional design problems than existing approaches.
To account for the varying needs of different problem settings, we introduce several classes of suitable variational lower bounds. Most notably, we introduce the \acrfullACE bound: an \acrshortEIG variational lower bound that can be made arbitrarily tight, while remaining amenable to simultaneous SGA on both the variational parameters and designs.
We demonstrate the applicability of our unified gradient approach using a wide range of experimental design problems, including a realworld highdimensional example from the pharmacology literature (lyu2019ultra). We find that our approaches are able to effectively optimize the \acrshortEIG, consistently outperforming baseline twostage approaches, with particularly large gains achieved for highdimensional problems. These gains lead, in turn, to improved designs and more informative experiments.
2 BACKGROUND
2.1 Bayesian optimal experimental design
When experimentation is costly, time consuming, or dangerous, it is essential to design experiments to learn the most from them. To choose between potential designs, we require a metric of the quality of a candidate design. In the \acrshortOED framework dating back to lindley1956, this metric represents how much more certain we will become in our knowledge of the world after doing the experiment and analyzing the data. We prefer designs that will lead to strong conclusions even if we are not yet sure what those conclusions will be.
Specifically, the \acrshortOED framework begins with Bayesian model of the experimental process. This is defined through the combination of a likelihood $p(y\theta ,\xi )$ that predicts the experimental outcome under design $\xi $ and latent variable $\theta $, and a prior $p(\theta )$ which incorporates initial beliefs about the unknown $\theta $. After conducting the experiment, our beliefs are updated to the posterior $p(\theta y,\xi )$. The information gained about $\theta $ from doing the experiment with design $\xi $ and obtaining outcome $y$ is the reduction in entropy from the prior to the posterior
$$\text{IG}(y,\xi )=H[p(\theta )]H[p(\theta y,\xi )].$$  (2) 
As it stands, information gain cannot be evaluated until after the experiment. To define a metric that will let us choose between designs before experimentation, we define the expected information gain (EIG), $I(\xi )$, by taking the expectation of IG over hypothesized outcomes $y$ using the marginal distribution under our model, $p(y\xi )$, to give
$$I(\xi )\triangleq {\mathbb{E}}_{p(y\xi )}\left[H[p(\theta )]H[p(\theta y,\xi )]\right]$$  (3) 
which can be rewritten in the form of a mutual information between $\theta $ and $y$ with $\xi $ fixed, namely
$$I(\xi )={\text{MI}}_{\xi}(\theta ;y)={\mathbb{E}}_{p(\theta )p(y\theta ,\xi )}\left[\mathrm{log}\frac{p(y\theta ,\xi )}{p(y\xi )}\right].$$  (4) 
The Bayesian optimal design, ${\xi}^{*}$, is now the one which maximizes EIG over the set of feasible designs $\mathrm{\Xi}$
${\xi}^{*}=\underset{\xi \in \mathrm{\Xi}}{\mathrm{arg}\mathrm{max}}I(\xi ).$  (5) 
In iterated experiment design, we design a sequence ${\xi}_{1},\mathrm{\dots},{\xi}_{T}$ of experiments. At time $t$, the prior $p(\theta )$ in (4) is replaced by the posterior given the previous experiments ${\xi}_{1},\mathrm{\dots},{\xi}_{t1}$ and their observed outcomes ${y}_{1},\mathrm{\dots},{y}_{t1}$, namely
$$p(\theta {\xi}_{1:t1},{y}_{1:t1})\propto p(\theta )\prod _{\tau =1}^{t1}p({y}_{\tau}\theta ,{\xi}_{\tau}).$$  (6) 
This now provides a means to design a sequence of adaptive experiments, wherein we use information gathered from previous iterations to improve the designs used at future iterations.
2.2 Estimating expected information gain
Making even a single point estimate of \acrshortEIG when solving (5) can be challenging because we must first estimate the unknown $p(y\xi )$ or $p(\theta y,\xi )$, and then take an expectation over $p(\theta )p(y\theta ,\xi )$. Nested Monte Carlo (NMC) estimators (nmc), which make a Monte Carlo approximation of both the inner and outer integrals, converge relatively slowly: at a rate $\mathcal{O}({T}^{1/3})$ in the total computational budget $T$.
foster2019variational noted that this approach is inefficient because it makes a separate \acrlongMC approximation of the integrand for every sample of the outer integral. To share information between different samples, they proposed a number of variational estimators that used amortization, i.e. they attempted to learn the functional form of the integrand rather than approximating it afresh each time. One of their approaches was based on amortized variational inference and required an inference network ${q}_{\varphi}(\theta y)$ which takes as input $\varphi ,y$ and outputs a distribution over $\theta $. For any ${q}_{\varphi}(\theta y)$, we can construct a lower bound on $I(\xi )$. This is the \acrfullBA, or posterior, lower bound (ba)
$${I}_{BA}(\xi ,\varphi )\triangleq {\mathbb{E}}_{p(\theta )p(y\theta ,\xi )}[\mathrm{log}{q}_{\varphi}(\theta y)]+H\left[p(\theta )\right],$$  (7) 
which has also found use representation learning (poole2018variational) and maximizing information transmission in noisy channels (ba).
To make highquality approximations to $I(\xi )$, and simultaneously learn a good posterior approximation, foster2019variational maximize this bound with respect to $\varphi $. This approach is most effective when the bound is tight, i.e. ${\mathrm{max}}_{\varphi}{I}_{BA}(\xi ,\varphi )=I(\xi )$. For ${I}_{BA}(\xi ,\varphi )$, this occurs when it is possible to have ${q}_{\varphi}(\theta y)=p(y\theta ,\xi )$, i.e. when the inference network is powerful enough to find the true posterior distribution for every $y$.
The required optimization is performed using SGA (robbins1951stochastic), which requires unbiased gradient estimators. For $\partial {I}_{BA}/\partial \varphi $, this is obtained by drawing $N$ Monte Carlo samples ${\theta}_{n},{y}_{n}\stackrel{\text{i.i.d.}}{\sim}p(\theta )p(y\theta ,\xi )$ and forming the estimator
$$\widehat{\frac{\partial {I}_{BA}}{\partial \varphi}}=\frac{1}{N}\sum _{n=1}^{N}\frac{\partial}{\partial \varphi}\mathrm{log}{q}_{\varphi}({\theta}_{n}{y}_{n}).$$  (8) 
To obtain highquality approximations of $I(\xi )$ even when the inference network cannot capture the true posterior, foster2019variational also considered another variational estimator: variational nested Monte Carlo (VNMC). This uses the inference network ${q}_{\varphi}(\theta y)$ in conjunction with additional samples to improve the estimate of the integrand. They showed that this leads to the following upper bound on $I(\xi )$
$${I}_{VNMC}(\xi ,\varphi ,L)\triangleq \mathbb{E}\left[\mathrm{log}\frac{p(y{\theta}_{0},\xi )}{\frac{1}{L}{\sum}_{\mathrm{\ell}=1}^{L}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right],$$  (9) 
where the expectation is over $p({\theta}_{0})p(y{\theta}_{0},\xi ){q}_{\varphi}({\theta}_{1:L}y)$. The inference network in VNMC is trained by minimization, in the same way ${I}_{BA}$ is trained by maximization. ${I}_{VNMC}$ has the attractive feature that the bound becomes tight as the number of inner samples becomes large, i.e. $L\to \mathrm{\infty}$, even if ${q}_{\varphi}({\theta}_{\mathrm{\ell}}y)$ is not powerful enough to directly represent the true posterior.
2.3 Optimizing the EIG
The experimental design problem is to find the design that maximizes the EIG. Therefore, as well as finding a way to estimate EIG, existing approaches subsequently need to find a way of searching across $\mathrm{\Xi}$ to find promising designs. At a highlevel, most existing approaches propose a twostage procedure in which noisy estimates of $I(\xi )$ are made, and a separate optimization procedure selects the candidate design $\xi $ to evaluate next.
kleinegesse2018efficient and foster2019variational both use \acrfullBO for this outer optimization step, a blackbox optimization method (snoek2012practical) that is tolerant to noise in the estimates of the objective function, in this case $I(\xi )$. Some approaches (watson2017quest+; lyu2019ultra) instead select a finite number of candidate designs in $\mathrm{\Xi}$ and estimate $I(\xi )$ at each candidate, with some refining this process further by adaptively allocating computational resources between these designs (vincent2017; rainforth2017thesis). Another suggested approach is to use MCMC methods to carry out this outer optimization (amzal2006bayesian; muller2005simulation).
3 GRADIENTBASED DESIGN OPTIMIZATION
Our central proposal is to replace the twostage procedure outlined above with a single stage that both estimates $I(\xi )$ and optimizes $\xi $ simultaneously. This has the critical advantage of allowing SGA to be directly applied to the design optimization. Not only does this provide substantial computational gains over approaches which must construct separate estimates for each hypothetical design considered, but it also provides the potential to scale to substantially higher dimensional design problems than those which can be effectively tackled with existing approaches. Since we take gradients with respect to $\xi $, we henceforth assume that $\mathrm{\Xi}$ is continuous.
In our approach, we utilize variational lower bounds on $I$. Specifically, suppose we have a bound $\mathcal{L}(\xi ,\varphi )\le I(\xi )$ with variational parameters $\varphi $. For fixed $\xi $, the estimate of $I(\xi )$ improves as we maximize with respect to $\varphi $. We propose to maximize $\mathcal{L}$ jointly with respect to $(\xi ,\varphi )$. As we train $\varphi $, the variational approximation improves; as we train $\xi $ our design moves to regions where the lower bound on \acrshortEIG is largest. By tackling this as a single optimization problem over $(\xi ,\varphi )$, we obviate the need to have an outer optimizer for $\xi $. Using a lower bound is important because it allows us to perform a single maximization over $(\xi ,\varphi )$, rather than a more complex optimization such as the maxmin optimization that would result if we used an upper bound.
In practice, we do not have lower bounds on $I$ that we can evaluate and differentiate in closed form. Instead, we have lower bounds in the form of expectations over $p(\theta )p(y\theta ,\xi )$. Fortunately, we can still maximize the lower bound with respect to $(\xi ,\varphi )$ in this case by using SGA. Stochastic gradient methods have been applied in machine learning to train millions of parameters (bottou2010large) and are a good fit for highdimensional experimental design.
3.1 \acrfullBA
We now make our first concrete proposal for the lower bound $\mathcal{L}(\xi ,\varphi )$: the BA bound ${I}_{BA}$, as defined in (7). We now optimize $(\xi ,\varphi )$ jointly where previously only $\varphi $ was trained using gradients. To perform SGA, we require unbiased estimators for both $\partial {I}_{BA}/\partial \varphi $ and $\partial {I}_{BA}/\partial \xi $. We use equation (8) to estimate the $\varphi $gradient. For the $\xi $gradient we have
$\widehat{{\displaystyle \frac{\partial {I}_{BA}}{\partial \xi}}}={\displaystyle \frac{1}{N}}{\displaystyle \sum _{n=1}^{N}}\mathrm{log}{q}_{\varphi}({\theta}_{n}{y}_{n}){\displaystyle \frac{\partial}{\partial \xi}}\mathrm{log}p({y}_{n}{\theta}_{n},\xi )$  (10) 
where ${\theta}_{n},{y}_{n}\stackrel{\text{i.i.d.}}{\sim}p(\theta )p(y\theta ,\xi )$, which is an unbiased estimator of $\partial {I}_{BA}/\partial \xi $. This is a score function estimator (williams1992simple) and other possibilities are discussed in Section 3.6.
3.2 Adaptive contrastive estimation (ACE)
The BA bound provides one specific case of our onestage procedure for optimal experimental design. We now introduce a new lower bound that improves upon ${I}_{BA}$. The potential issue with the BA bound is that it may not be sufficiently tight, which happens when the inference network cannot represent the true posterior. One possible solution is to introduce additional samples, as in the VNMC estimator (9). However, we cannot use VNMC directly for a onestage procedure: since it is an upper bound, we must minimize it with respect to $\varphi $, but we still wish to maximize with respect to $\xi $. This precludes a joint maximization over $(\xi ,\varphi )$.
Looking more closely at the VNMC bound, we see that its main failure case is when the denominator strongly underestimates $p(y\xi )$, which can happen when all the inner samples ${\theta}_{1},\mathrm{\dots},{\theta}_{L}$ miss regions where the joint $p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )$ is large. In addition to the samples ${\theta}_{1:L}$, we also have the original sample ${\theta}_{0}$ from which $y$ was sampled. One way to avoid the underestimation in the denominator would be to include this sample, giving
$${I}_{ACE}(\xi ,\varphi ,L)=\mathbb{E}\left[\mathrm{log}\frac{p(y{\theta}_{0},\xi )}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right]$$  (11) 
where the expectation is with respect to $p({\theta}_{0})p(y{\theta}_{0},\xi )q({\theta}_{1:L}y)$. The samples ${\theta}_{1:L}$ can now be seen as contrasts to the original sample ${\theta}_{0}$. For this reason, we call ${\theta}_{1:L}$ contrastive samples and we call (11) the adaptive contrastive estimate (ACE) of EIG. The following theorem establishes that ${I}_{ACE}$ is a valid lower bound on the EIG and that the bound becomes tight as $L\to \mathrm{\infty}$.
Theorem 1.
For any model $p\mathit{}\mathrm{(}\theta \mathrm{)}\mathit{}p\mathit{}\mathrm{(}y\mathrm{}\theta \mathrm{,}\xi \mathrm{)}$ and inference network ${q}_{\varphi}\mathit{}\mathrm{(}\theta \mathrm{}y\mathrm{)}$, we have the following:

1.
${I}_{ACE}$ is a lower bound on the \acrshortEIG and we can characterize the error term as an expected KL divergence:
$I$ $(\xi ){I}_{ACE}(\xi ,\varphi ,L)$ $={\mathbb{E}}_{p(y\xi )}\left[KL\left(P({\theta}_{0:L}y)\right\left{\displaystyle \prod _{\mathrm{\ell}}}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)\right)\right]\ge 0,$ $P$ $({\theta}_{0:L}y)={\displaystyle \frac{1}{L+1}}{\displaystyle \sum _{\mathrm{\ell}=0}^{L}}p({\theta}_{\mathrm{\ell}}y,\xi ){\displaystyle \prod _{k\ne \mathrm{\ell}}}{q}_{\varphi}({\theta}_{k}y).$ 
2.
As $L\to \mathrm{\infty}$, we recover the true \acrshortEIG:
${lim}_{L\to \mathrm{\infty}}{I}_{ACE}(\xi ,\varphi ,L)=I(\xi )$. 
3.
The ACE bound is monotonically increasing in $L$: ${I}_{ACE}(\xi ,\varphi ,{L}_{2})\ge {I}_{ACE}(\xi ,\varphi ,{L}_{1})$ for ${L}_{2}\ge {L}_{1}\ge 0$.

4.
If the inference network equals the true posterior ${q}_{\varphi}(\theta y)=p(\theta y,\xi )$, then ${I}_{ACE}(\xi ,\varphi ,L)=I(\xi ),\forall L$.
The proof is presented in Appendix A, along with Theorem 3 which concerns properties of the maximum of the ACE bound. Gradient estimation for ACE is discussed in Section 3.6. We note that, to the best of our knowledge, ${I}_{ACE}$ has not previously appeared in the \acrshortOED literature.^{1}^{1} 1 Aside from a recent blog post (sobolev2019thoughts) we do not believe this bound has previously been studied in any context.
3.3 Prior contrastive estimation (PCE)
Theorem 1 tells us that ${I}_{ACE}$ can become close to $I(\xi )$ in two scenarios: 1) the inference network becomes close to the true posterior $p(\theta y,\xi )$, 2) we increase the number of contrastive samples $L$. The BA bound only becomes tight in case 1). A special case of ACE is to replace the inference network ${q}_{\varphi}(\theta y)$ with a fixed distribution and rely on the contrastive samples to make good estimates of $I(\xi )$, only becoming tight in case 2), i.e. as $L\to \mathrm{\infty}$. This simplification can speed up training, since we no longer need to learn additional parameters $\varphi $.
To explore this, we propose the prior contrastive estimation (PCE) bound, in which the prior $p(\theta )$ is used to generate contrastive samples:
$${I}_{PCE}(\xi ,L)\triangleq \mathbb{E}\left[\mathrm{log}\frac{p(y\theta ,\xi )}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}p(y{\theta}_{\mathrm{\ell}},\xi )}\right],$$  (12) 
where the expectation is over $p({\theta}_{0})p(y{\theta}_{0},\xi )p({\theta}_{1:L})$. Whilst inherently less powerful than ACE, PCE can be effective when the prior and posterior are not too different, so $p(\theta )$ is a suitable proposal to estimate $p(y\xi )$.
Though, to the best of our knowledge, this bound has not been applied to \acrshortOED before, we note that it shares a connection to the information noise contrastive estimation (InfoNCE) bound on mutual information used in representation learning (oord2018representation). Given $K$ data samples ${x}_{k}$, corresponding representations ${z}_{k}$, and a learned critic ${f}_{\psi}(x,z)\ge 0$, we have
$$\text{MI}(x;z)\ge \mathbb{E}\left[\frac{1}{K}\sum _{k=1}^{K}\mathrm{log}\frac{{f}_{\psi}({x}_{k},{z}_{k})}{\frac{1}{K}{\sum}_{\mathrm{\ell}=1}^{K}{f}_{\psi}({x}_{\mathrm{\ell}},{z}_{k})}\right]$$  (13) 
where the expectation is over $p(x)p(zx)$, $p(x)$ is the data distribution, and $p(zx)$ is the encoder. poole2018variational showed that the encoder density $p(zx)$ is the optimal critic, although it is rarely known in closed form in the representation learning context. Writing $\theta $ for $x$ and $y$ for $z$, we note the mathematical connection between this optimal case and ${I}_{PCE}$.
3.4 Likelihoodfree ACE
In some models such as random effects models, the likelihood $p(y\theta ,\xi )$ is not known in closed form but can be sampled from. This presents a problem when computing ${I}_{ACE}$ or its derivatives because the likelihood appears in (11). To allow ACE to be used for these kinds of models, we now show that using a unnormalized approximation to the likelihood still results in a valid lower bound on \acrshortEIG. In fact, if using a parametrized likelihood approximation ${f}_{\psi}$, it is then possible to train $\psi $ jointly with $(\xi ,\varphi )$ to approximate the likelihood, learn an inference network, and find the optimal design through the solution to a single optimization problem. The following theorem, whose proof is presented in Appendix A, shows that replacing the likelihood with an unnormalized approximation does result in a valid lower bound on \acrshortEIG.
Theorem 2.
Consider a model $p\mathit{}\mathrm{(}\theta \mathrm{)}\mathit{}p\mathit{}\mathrm{(}y\mathrm{}\theta \mathrm{,}\xi \mathrm{)}$ and inference network ${q}_{\varphi}\mathit{}\mathrm{(}\theta \mathrm{}y\mathrm{)}$. Let ${f}_{\psi}\mathit{}\mathrm{(}\theta \mathrm{,}y\mathrm{)}\mathrm{\ge}\mathrm{0}$ be an unnormalized likelihood approximation. Then,
$$I(\xi )\ge \mathbb{E}\left[\mathrm{log}\frac{{f}_{\psi}({\theta}_{0},y)}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right]$$  (14) 
where the expectation is over $p\mathit{}\mathrm{(}{\theta}_{\mathrm{0}}\mathrm{)}\mathit{}p\mathit{}\mathrm{(}y\mathrm{}{\theta}_{\mathrm{0}}\mathrm{,}\xi \mathrm{)}\mathit{}{q}_{\varphi}\mathit{}\mathrm{(}{\theta}_{\mathrm{1}\mathrm{:}L}\mathrm{}y\mathrm{)}$.
3.5 Iterated experimental design with ACE
In iterated experimental design, we replace $p(\theta )$ by $p(\theta {y}_{1:t1},{\xi}_{1:t1})$ as per (6). We can sample $p(\theta {y}_{1:t1},{\xi}_{1:t1})$ by performing inference. Whilst variational inference also provides a closed form estimate of the posterior density, some other inference methods do not. This is problematic because the prior density appears in (11). Fortunately, it is sufficient to know the density up to proportionality (foster2019variational). Indeed if $p(\theta )=A\cdot \gamma (\theta )$ where $A$ does not depend on $(\xi ,\varphi ,y)$ and $\gamma $ is an unnormalized density, then
$$I(\xi )\ge \mathbb{E}\left[\mathrm{log}\frac{p(y{\theta}_{0},\xi )}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{\gamma ({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right]\mathrm{log}A$$  (15) 
and the derivatives of $\mathrm{log}A$ are simply zero.
3.6 Gradient estimation for ACE
To optimize the ACE bound with respect to $(\xi ,\varphi )$ it is necessary to have unbiased gradient estimators of $\partial {I}_{ACE}/\partial \xi $ and $\partial {I}_{ACE}/\partial \varphi $.
The simplest form of the $\xi $gradient is
$$\frac{\partial {I}_{ACE}}{\partial \xi}=\mathbb{E}\left[\frac{\partial g}{\partial \xi}+g\cdot \frac{\partial}{\partial \xi}\mathrm{log}p(y{\theta}_{0},\xi )\right].$$  (16) 
where the expectation is with respect to $p({\theta}_{0})p(y\theta ,\xi )q({\theta}_{1:L}y)$, and
$$g(y,{\theta}_{0:L},\varphi ,\xi )=\mathrm{log}\frac{p(y{\theta}_{0},\xi )}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}.$$  (17) 
Estimating the expectation (16) directly using Monte Carlo gives the score function, or REINFORCE, estimator (williams1992simple). Unfortunately, this is often high variance, and reducing gradient estimator variance is often important in solving challenging experimental design problems.
One variance reduction method is reparameterization. For this, we introduce random variables $\u03f5,{\u03f5}_{1:L}^{\prime}$ whose distributions do not depend on $(\xi ,\varphi )$ along with representations of $y$ and $\theta $ as deterministic functions of these variables:
$y=y({\theta}_{0},\xi ,\u03f5)\mathit{\hspace{1em}\hspace{1em}}{\theta}_{\mathrm{\ell}}=\theta (y,\varphi ,{\u03f5}_{\mathrm{\ell}}^{\prime}).$  (18) 
This now permits a reparameterized form of the gradient:
$$\frac{\partial {I}_{ACE}}{\partial \xi}=\mathbb{E}\left[\frac{\partial g}{\partial \xi}+\frac{\partial g}{\partial y}\frac{\partial y}{\partial \xi}+\sum _{\mathrm{\ell}=1}^{L}\frac{\partial g}{\partial {\theta}_{\mathrm{\ell}}}\frac{\partial {\theta}_{\mathrm{\ell}}}{\partial y}\frac{\partial y}{\partial \xi}\right]$$  (19) 
where the expectation is over $p({\theta}_{0})p(\u03f5)p({\u03f5}_{1:L}^{\prime})$. A Monte Carlo approximation of this expectation is typically a much lower variance estimator for the true $\xi $gradient.
Alternatively, if $y$ is a discrete random variable we can sum over the possible values $\mathcal{Y}$. This approach is known as RaoBlackwellization and gives
$$\frac{\partial {I}_{ACE}}{\partial \xi}=\sum _{y\in \mathcal{Y}}\mathbb{E}\left[\frac{\partial g}{\partial \xi}p(y{\theta}_{0},\xi )+g\frac{\partial}{\partial \xi}p(y{\theta}_{0},\xi )\right]$$  (20) 
where the expectation is now over $p({\theta}_{0}){\prod}_{\mathrm{\ell}=1}^{L}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)$.
4 EXPERIMENTS
In our experiments, we learn optimal experimental designs in five scenarios: the death process, a well known twodimensional design problem from epidemiology; a nonconjugate regression model with a 400dimensional design; an ablation study in the setting of advertising; a realworld biomolecular docking problem from pharmacology in 100 dimensions; and a constant elasticity of substitution iterated experimental design problem in behavioural economics with 6 dimensional designs.
4.1 Evaluating experimental designs
We first discuss which metrics we will use to judge the quality of the designs we obtain.
Our primary metric on designs is, of course, the \acrshortEIG. We prefer designs with high \acrshortEIGs. In some cases, we can evaluate the \acrshortEIG analytically. In other cases, we can use a sufficiently large number of samples in a \acrshortNMC (nmc) estimator to be sure that we have estimates that are sufficiently accurate to compare designs.
However, to explore the limits of our methods, we will also consider scenarios where neither of these approaches is suitable. In these cases, we pair the ACE lower bound (with $\xi $ fixed for evaluation) with the VNMC upper bound (foster2019variational) to trap the true \acrshortEIG value—if the lower bound of one design is higher than the upper bound for another, we can be sure that the first design is superior (noting that the bounds themselves can be tractably estimated to a very high accuracy).
In some settings, when we know the true optimal design ${\xi}^{*}$, we will also consider the design error $\parallel {\xi}^{*}\xi \parallel $, i.e. how close our design is to the optimal design.
In iterated experiment design, as well as designing experiments, we must also perform inference on the latent variable $\theta $ after each iteration. Here, we also investigate the quality of the final posterior. Specifically, if $p(\theta {y}_{1:t},{\xi}_{1:t})$ is the posterior after $t$ experiments, we use the posterior entropy, and the posterior RMSE ${\mathbb{E}}_{\theta \sim p(\theta {y}_{1:t},{\xi}_{1:t})}{\left[{(\theta {\theta}^{*})}^{2}\right]}^{1/2}$. We prefer low entropy and low RMSE values.
4.2 Death process
We consider an example from epidemiology, the death process (cook2008optimal; kleinegesse2018efficient), in which a population of $N=10$ individuals transitions from healthy to infected states at a constant but unknown rate $\theta $. We can measure the number of infected individuals at two different times ${\xi}_{1}$ and ${\xi}_{1}+{\xi}_{2}$ where ${\xi}_{1},{\xi}_{2}\ge 0$. Our aim is to infer the infection rate $\theta $ from these observations. For full details of the prior and likelihood used, see Appendix B.1.
On this problem, we apply gradient methods with RaoBlackwellization (there are 66 possible outcomes). Figure 1 shows a sample optimization trajectory with the approximate EIG surface for illustration. We compare against \acrshortBO using the RaoBlackwellized NMC estimator of vincent2017. Figure 2 shows that, for the allowed time budget, all gradient methods perform better than BO even on this twodimensional problem.
4.3 Regression
We now compare our onestage gradient approaches to experimental design against a twostage baseline on a highdimensional design problem. We choose a general purpose Bayesian linear regression model with $n$ observations and $p$ features. The design $\xi $ is an $n\times p$ matrix; the latent variables are $\theta =(\mathbf{w},\sigma )$, where $\mathbf{w}$ is the $p$ dimensional regression coefficient and ${\sigma}^{2}$ is the scalar variance. The $n$ outcomes are generated using a Normal likelihood ${y}_{i}\sim N({\bm{\xi}}_{i}\cdot \mathbf{w},\sigma )$ for $i=1,\mathrm{\dots},n$. Here ${\bm{\xi}}_{i}$ is the $i$th row of $\xi $. To avoid trivial solutions, we enforce the constraint ${\parallel {\bm{\xi}}_{i}\parallel}_{1}=1$ for all $i$. We use independent priors ${w}_{j}\sim \text{Laplace}(1)$ for $j=1,\mathrm{\dots},p$ and $\sigma \sim \text{Exp}(1)$. See Appendix B.2 for complete details.
We set $n=p=20$ and applied five methods to this 400 dimensional design problem: BA, ACE and PCE, as well as the VNMC estimator of foster2019variational, with both \acrshortBO and random search to optimize over $\mathrm{\Xi}$. The results are presented in Table 1. We note that the gradient methods strongly outperform the gradientfree baselines, with about double the final EIG.
Method 
EIG l.b.  EIG u.b. 

ACE  $16.1\pm 0.1$  $20.7\pm 0.2$ 
PCE  $16.6\pm 0.1$  $21.5\pm 0.2$ 
BA  $16.4\pm 0.2$  $21.1\pm 0.2$ 
BO + VNMC  $7.3\pm 0.1$  $9.6\pm 0.1$ 
Random Search + VNMC  $7.1\pm 0.1$  $9.4\pm 0.1$ 
4.4 Advertising
We now conduct a detailed ablation study on the effects of dimension on the quality of experimental designs produced using our gradient approaches and \acrshortBO. To further isolate the distinction between onestage and twostage approaches to \acrshortOED, we choose a setting in which we can compute $I(\xi )$ analytically. We give \acrshortBO, but not the gradient methods, access to the EIG oracle when making point evaluations of $I(\xi )$. Thus we put \acrshortBO in the best possible position and ensure any gains are due to improvements from using gradientbased optimization.
Suppose that we are given an advertising budget of $B$ dollars that we need to allocate among $D$ regions, i.e. we choose $\bm{\xi}\ge \mathrm{\U0001d7ce}$ with ${\sum}_{i=1}^{D}{\xi}_{i}=B$. After conducting an ad campaign, we observe a vector of sales $\bm{y}$. We use this data to make inferences about the underlying market opportunities $\bm{\theta}$ in each region. Our prior incorporates the knowledge that neighbouring regions are more correlated than distant ones—this leads to an interesting experimental design problem because information can be pooled between regions. We can also compute the true EIG and optimal design ${\xi}^{*}$ analytically. For full details, see Appendix B.3.
We compare the performance of four estimation and optimization methods on this problem, see Fig. 3 for the results. The three gradientbased methods (ACE, PCE, BA) perform best, with the BO baseline struggling in dimensions $D\ge 6$, even though the latter has access to an EIG oracle. PCE performed well in low dimensions, but degraded as the dimension increases and sampling from the prior becomes increasingly inefficient—something that ACE and BA avoid by learning adaptive proposal distributions. We note that since in this case the family of variational distributions used in ACE and BA include the true posterior, both methods yield similar performance.
4.5 Biomolecular docking
We now consider an experimental design problem of interest to the pharmacology community. Having demonstrated that our onestage gradient methods compare favourably with two stage approaches, we now compare against designs crafted by domain experts.
In molecular docking, computational techniques are used to predict the binding affinity between a compound and a receptor. When synthesized in the lab, the two may bind—this is called a hit. Learning a wellcalibrated hitrate model can guide how many compounds to evaluate for additional objectives, such as druglikeness or toxicity, before experimental testing. lyu2019ultra modelled the probability of outcome ${y}_{i}$ being a hit, given the predicted binding affinity, or docking score, ${\xi}_{i}\in [75,0]$, as
$$p({y}_{i}=1\theta ,\xi )=\text{bottom}+\frac{\text{top}\text{bottom}}{1+{e}^{({\xi}_{i}\text{ee50})\times \text{slope}}}$$  (21) 
where $\theta =(\text{top},\text{bottom},\text{ee50},\text{slope})$ with priors given in Appendix B.4.
Of 150 million compounds, lyu2019ultra selected a batch of compounds to experimentally test to best fit the sigmoid hitrate model. They considered 6 candidate designs and selected one that maximized the EIG estimated by NMC. Here, we instead apply gradientbased BOED to search across candidate designs which consist of 100 docking scores ${\xi}_{1},\mathrm{\dots},{\xi}_{100}$. To evaluate our final designs, we present upper and lower bounds on the final EIG: see Table 2. We see that all gradient methods are able to outperform experts in terms of \acrshortEIG, and that ACE appears the best of the gradient methods. Figure 5 shows our designs are qualitatively different to those produced by experts.
4.6 Constant elasticity of substitution
Method  EIG lower bound  EIG upper bound 

ACE  $\mathbf{1.0835}\pm \mathbf{0.0003}$  $\mathbf{1.0852}\pm \mathbf{0.0001}$ 
PCE  $1.0825\pm 0.0002$  $1.0839\pm 0.0002$ 
BA  $1.0780\pm 0.0003$  $1.0794\pm 0.0003$ 
Expert  $1.0191$  $1.0227$ 
We now turn to iterated experimental design in which we produce designs, generate data and make inference repeatedly. This problem therefore captures the endtoendprocess of experimentation and inference.
We consider an experiment in behavioural economics that was previously also considered by foster2019variational. In this experiment, a participant is asked to compare baskets $\mathbf{x},{\mathbf{x}}^{\prime}$ of goods. The model assumes that their response (on a slider) is based on the difference in utility of the baskets, and the constant elasticity of substitution (CES) model (arrow1961capital) governed by latent variables $(\rho ,\bm{\alpha},u)$ is then used for this utility. The aim is to learn $(\rho ,\bm{\alpha},u)$ characterizing the participant’s utility. In the experiment, there are 20 sequential steps of experimentation with the same participant. We compare our gradientbased approach against the most successful approach of foster2019variational that approximates the marginal density to form an upper bound on EIG, and BO to optimize $\xi $. For full details, see Appendix B.5.
Figure 4 shows that gradientbased methods are effective on this problem; both ACE and PCE decrease the posterior entropy and RMSEs on the latent variables faster and further than the baseline, whereas BA does not do so well. We suggest that the similar performance of ACE and PCE is due to the smaller changes in the posterior at middle and late steps, after much data has been accumulated: when the posterior does not change much at each step, $p(\theta {y}_{1:t1},{\xi}_{1:t1})$ forms an effective proposal for estimating $p({y}_{t}{\xi}_{t})$.
5 CONCLUSIONS
We have introduced a new approach for Bayesian experimental design that does away with the two stages of estimating EIG and separately optimizing over $\mathrm{\Xi}$. We use stochastic gradients to maximize a lower bound on $I(\xi )$ and so find optimal designs by solving a single optimization problem. This unification leads to substantially improved performance, especially on highdimensional design problems.
Of the three lower bounds, ${I}_{BA},{I}_{ACE}$ and ${I}_{PCE}$, we note that in all five experiments ACE generally did as well as the better of BA and PCE: we therefore recommend it as the default choice. BA performed well when the inference network could closely approximate the true posterior; PCE performed well when the prior was an adequate proposal for estimating $p(y\xi )$ and does not require the training of variational parameters.
Acknowledgements
AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) ERC grant agreement no. 617071. TR gratefully acknowledges funding from Tencent AI Labs and a junior research fellowship supported by Christ Church, Oxford.
References
Appendix A GRADIENTBASED DESIGN OPTIMIZATION
We begin with the proof of Theorem 1, which we restate for convenience.
See 1
Proof.
To begin with 1., we have the error term $\delta =I(\xi ){I}_{ACE}(\xi ,\varphi ,L)$ which can be written
$\delta $  $=\mathbb{E}\left[\mathrm{log}{\displaystyle \frac{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}{p(y\xi )}}\right]$  (22)  
$=\mathbb{E}\left[\mathrm{log}{\displaystyle \frac{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}p({\theta}_{\mathrm{\ell}}y){\prod}_{k\ne \mathrm{\ell}}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}{{\prod}_{\mathrm{\ell}=0}^{L}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right]$  (23)  
$=\mathbb{E}\left[\mathrm{log}{\displaystyle \frac{P({\theta}_{0:\mathrm{\ell}}y)}{{\prod}_{\mathrm{\ell}=0}^{L}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}\right]$  (24) 
where the expectation is over $p(y\xi )p({\theta}_{0}y,\xi ){\prod}_{\mathrm{\ell}=1}^{L}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)$. Note that the integrand is symmetric under a permutation of the labels $0,\mathrm{\dots},L$, so its expectation will be the same under the distribution $p(y\xi )p({\theta}_{\mathrm{\ell}}y,\xi ){\prod}_{k\ne \mathrm{\ell}}{q}_{\varphi}({\theta}_{k}y)$. This in turn implies that we can consider this as an expectation under $P$
$$\delta ={\mathbb{E}}_{p(y\xi )P({\theta}_{0:L}y)}\left[\mathrm{log}\frac{P({\theta}_{0:L}y)}{{\prod}_{\mathrm{\ell}=0}^{L}{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}\right]$$  (25) 
which is the expected KL divergence required. We therefore have $\delta \ge 0$.
For 2., we assume that $p(\theta )p(y\theta ,\xi )/{q}_{\varphi}(\theta y)$ is bounded. The ACE denominator is a consistent estimator of the marginal likelihood. Indeed,
$$\frac{1}{L+1}\frac{p({\theta}_{0})p(y{\theta}_{0},\xi )}{{q}_{\varphi}({\theta}_{0}y)}\to 0\text{by boundedness}$$  (26) 
and
$$\frac{1}{L+1}\sum _{\mathrm{\ell}=1}^{L}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}\to p(y\xi )a.s.$$  (27) 
as $L\to \mathrm{\infty}$ by the Strong Law of Large Numbers, since
$${\mathbb{E}}_{{q}_{\varphi}(\theta y)}\left[\frac{p(\theta )p(y\theta ,\xi )}{{q}_{\varphi}(\theta y)}\right]=p(y\xi ).$$  (28) 
This establishes the a.s. pointwise convergence of the ACE integrand to $\mathrm{log}p(y{\theta}_{0},\xi )/p(y\xi )$. Hence by Bounded Convergence Theorem,
$${\widehat{I}}_{ACE}(\xi ,\varphi ,L)\to I(\xi )$$  (29) 
as $L\to \mathrm{\infty}$.
To establish 3., we let $\epsilon ={I}_{ACE}(\xi ,\varphi ,{L}_{2}){I}_{ACE}(\xi ,\varphi ,{L}_{1})$. Then
$\epsilon $  $=\mathbb{E}\left[\mathrm{log}{\displaystyle \frac{\frac{1}{{L}_{1}+1}{\sum}_{\mathrm{\ell}=0}^{{L}_{1}}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{q({\theta}_{\mathrm{\ell}}y)}}{\frac{1}{{L}_{2}+1}{\sum}_{\mathrm{\ell}=0}^{{L}_{2}}\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{q({\theta}_{\mathrm{\ell}}y)}}}\right]$  (30)  
$=\mathbb{E}\left[\mathrm{log}{\displaystyle \frac{Q({\theta}_{0:{L}_{2}}y)}{\frac{1}{{L}_{2}+1}{\sum}_{\mathrm{\ell}=0}^{{L}_{2}}p({\theta}_{\mathrm{\ell}}y){\prod}_{k\ne \mathrm{\ell}}q({\theta}_{\mathrm{\ell}}y)}}\right]$  (31) 
where the expectation is over $p(y\xi )p({\theta}_{0}y,\xi ){\prod}_{\mathrm{\ell}=1}^{{L}_{2}}q({\theta}_{\mathrm{\ell}}y)$ and
$$Q({\theta}_{0:{L}_{2}}y)=\frac{1}{{L}_{1}+1}\sum _{\mathrm{\ell}=0}^{{L}_{1}}p({\theta}_{\mathrm{\ell}}y)\prod _{k\ne \mathrm{\ell}}^{{L}_{2}}q({\theta}_{\mathrm{\ell}}y).$$  (32) 
By symmetry, we can consider this as an expectation over $Q$ (we can permute the labels $0,\mathrm{\dots},{L}_{1}$). We therefore recognise $\epsilon $ as the expectation of a KL divergence. Hence $\epsilon \ge 0$ as required.
4. follows by Bayes Theorem, i.e.
$$\frac{p(\theta )p(y\theta ,\xi )}{p(\theta y,\xi )}=p(y\xi ).$$  (33) 
which completes the proof. ∎
We also present the proof of Theorem 2.
See 2
Proof.
Initially, we note that the contrastive samples ${\theta}_{1},\mathrm{\dots},{\theta}_{L}$ do not carry additional information about ${\theta}_{0}$. We consider the mutual information between ${\theta}_{0}$ and the random variable $(y,{\theta}_{1},\mathrm{\dots},{\theta}_{L})$. Using the Chain Rule for mutual information we have
$$\text{MI}({\theta}_{0};(y,{\theta}_{1},\mathrm{\dots},{\theta}_{L}))=\text{MI}({\theta}_{0};y)+\text{MI}({\theta}_{0};({\theta}_{1},\mathrm{\dots},{\theta}_{L})y)$$  (34) 
Now $\text{MI}({\theta}_{0};({\theta}_{1},\mathrm{\dots},{\theta}_{L})y)=0$ since ${\theta}_{\mathrm{\ell}}$ ($\mathrm{\ell}>0$) are conditionally independent of ${\theta}_{0}$ given $y$. Therefore
$$\text{MI}({\theta}_{0};(y,{\theta}_{1},\mathrm{\dots},{\theta}_{L}))=\text{MI}({\theta}_{0};y)=I(\xi ).$$  (35) 
We now use the DonskerVaradhan representation of mutual information (donsker1975asymptotic). Specifically, for random variables $A,B$ with joint distribution $p(a,b)$ and any measurable function $T(a,b)$ we have
$$\text{MI}(a;b)\ge {\mathbb{E}}_{p(a,b)}[T(a,b)]\mathrm{log}{\mathbb{E}}_{p(a)p(b)}\left[{e}^{T(a,b)}\right].$$  (36) 
We now use this representation with $a={\theta}_{0},b=(y,{\theta}_{1},\mathrm{\dots},{\theta}_{L})$ and $T(a,b)$ the integrand
$$T({\theta}_{0},(y,{\theta}_{1:L}))=\mathrm{log}\frac{{f}_{\psi}({\theta}_{0},y)}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}.$$  (37) 
We compute the second term in (36), $Z={\mathbb{E}}_{p(a)p(b)}\left[{e}^{T(a,b)}\right]$.
$Z$  $={\mathbb{E}}_{p({\theta}_{0})p(y\xi ){q}_{\varphi}({\theta}_{1:L}y)}\left[{\displaystyle \frac{{f}_{\psi}({\theta}_{0},y)}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}}\right]$  (38)  
$={\mathbb{E}}_{p(y\xi ){q}_{\varphi}({\theta}_{0:L}y)}\left[{\displaystyle \frac{\frac{p({\theta}_{0}){f}_{\psi}({\theta}_{0},y)}{{q}_{\varphi}({\theta}_{0}y)}}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}}\right]$  (39)  
$={\mathbb{E}}_{p(y\xi ){q}_{\varphi}({\theta}_{0:L}y)}\left[{\displaystyle \frac{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p({\theta}_{\mathrm{\ell}}){f}_{\psi}({\theta}_{\mathrm{\ell}},y)}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}}}\right]$  (40)  
$=1$  (41) 
where the second to last line follows by symmetry. This establishes that $\mathrm{log}Z=0$, and so (14) constitutes a valid lower bound on $I(\xi )$. That is
$I(\xi )$  $\ge \mathbb{E}\left[\mathrm{log}{\displaystyle \frac{{f}_{\psi}(y,{\theta}_{0})}{\frac{1}{L+1}{\sum}_{\mathrm{\ell}=0}^{L}\frac{p(\theta ){f}_{\psi}(y,{\theta}_{\mathrm{\ell}})}{{q}_{\varphi}({\theta}_{\mathrm{\ell}},y)}}}\right]$  (42) 
which completes the proof. ∎
The following theorem establishes a condition under which the maximum of the ACE objective converges to the maximum of the EIG as $L\to \mathrm{\infty}$.
Theorem 3.
Consider a model $p\mathit{}\mathrm{(}\theta \mathrm{)}\mathit{}p\mathit{}\mathrm{(}y\mathrm{}\theta \mathrm{,}\xi \mathrm{)}$ such that
$$  (43) 
and $$. Let ${q}_{\varphi}\mathit{}\mathrm{(}\theta \mathrm{}y\mathrm{)}$ be an inference network and let
$${I}_{L}=\underset{\xi \in \mathrm{\Xi},\varphi \in \mathrm{\Phi}}{sup}{I}_{ACE}(\xi ,\varphi ,L).$$  (44) 
Then,
$$0\le {I}^{*}{I}_{L}\le \frac{C1}{L+1}$$  (45) 
and in particular ${I}_{L}\mathrm{\to}{I}^{\mathrm{*}}$ as $L\mathrm{\to}\mathrm{\infty}$.
Proof.
We have $0\le {I}^{*}{I}_{L}$ since ${I}_{ACE}$ is a lower bound on $I(\xi )$ by Theorem 1.
Next, we consider $\mathrm{\Delta}(\xi ,\varphi ,L)=I(\xi ){I}_{ACE}(\xi ,\varphi ,L)$. We have
$$\mathrm{\Delta}={\mathbb{E}}_{p({\theta}_{0})p(y{\theta}_{0},\xi ){q}_{\varphi}({\theta}_{1:L}y)}\left[\mathrm{log}\frac{{Y}_{L}}{p(y\xi )}\right]$$  (46) 
where
$${Y}_{L}=\frac{1}{L+1}\sum _{\mathrm{\ell}=0}^{L}{w}_{\mathrm{\ell}}\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{w}_{\mathrm{\ell}}=\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)};$$  (47) 
we write (46) as
$$\mathrm{\Delta}=\mathbb{E}\left[\mathrm{log}\left(1+\frac{{Y}_{L}p(y\xi )}{p(y\xi )}\right)\right]$$  (48) 
and we apply the inequality $\mathrm{log}(1+x)\le x$ to give
$$\mathrm{\Delta}\le \mathbb{E}\left[\frac{{Y}_{L}p(y\xi )}{p(y\xi )}\right].$$  (49) 
We now observe that for $\mathrm{\ell}>0$, ${\mathbb{E}}_{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}[{w}_{\mathrm{\ell}}]=p(y\xi )$ and hence, taking a partial expectation over ${\theta}_{1:L}$ we have
$\mathrm{\Delta}$  $\le {\mathbb{E}}_{p({\theta}_{0})p(y{\theta}_{0},\xi )}\left[{\displaystyle \frac{{w}_{0}p(y\xi )}{(L+1)p(y\xi )}}\right]$  (50)  
$\le {\displaystyle \frac{1}{L+1}}\left({\mathbb{E}}_{p({\theta}_{0})p(y{\theta}_{0},\xi )}\left[{\displaystyle \frac{p({\theta}_{0}y,\xi )}{{q}_{\varphi}({\theta}_{0}y)}}\right]1\right)$  (51) 
Hence
${I}^{*}{I}_{L}$  $=\underset{\xi \in \mathrm{\Xi}}{sup}I(\xi )\underset{\xi \in \mathrm{\Xi},\varphi \in \mathrm{\Phi}}{sup}{I}_{ACE}(\xi ,\varphi ,L)]$  (52)  
$\le \underset{\xi \in \mathrm{\Xi}}{sup}[I(\xi )\underset{\varphi \in \mathrm{\Phi}}{sup}{I}_{ACE}(\xi ,\varphi ,L)]$  (53)  
$\le \underset{\xi \in \mathrm{\Xi}}{sup}\underset{\varphi \in \mathrm{\Phi}}{inf}[\mathrm{\Delta}(\xi ,\varphi ,L)]$  (54)  
$\le {\displaystyle \frac{C1}{L+1}}$  (55) 
as required. ∎
A.1 Double reparametrization
We have the $\varphi $gradient
$$\frac{\partial {I}_{ACE}}{\partial \varphi}=\mathbb{E}[\frac{\partial}{\partial \varphi}{\mathbb{E}}_{q({\theta}_{1:L}y)}\left[\mathrm{log}\left(\sum _{\mathrm{\ell}=0}^{L}{w}_{\mathrm{\ell}}\right)\right{\theta}_{0},y]]$$  (56) 
where
$${w}_{\mathrm{\ell}}=\frac{p({\theta}_{\mathrm{\ell}})p(y{\theta}_{\mathrm{\ell}},\xi )}{{q}_{\varphi}({\theta}_{\mathrm{\ell}}y)}$$  (57) 
and the outer expectation is over $p({\theta}_{0})p(y{\theta}_{0},\xi )$. If ${q}_{\varphi}(\theta y)$ is reparameterizable as a function of $\varphi $, then we can apply double reparameterization to this gradient. Indeed, were it not for the ${w}_{0}$ term, this would be exactly the IWAE of burda2015importance. We can exploit the double reparameterization of tucker2018doubly with a minor variation to account for ${w}_{0}$ to obtain a low variance gradient estimator.
The doubly reparametrized gradient for ACE takes the form
$$\frac{\partial {I}_{ACE}}{\partial \varphi}={\mathbb{E}}_{p({\theta}_{0})p(y{\theta}_{0},\xi ){q}_{\varphi}({\theta}_{1:L}y)}\left[\sum _{\mathrm{\ell}=0}^{L}{v}_{\mathrm{\ell}}\right]$$  (58) 
where
$${v}_{0}=\frac{{w}_{0}}{{\sum}_{m=0}^{L}{w}_{m}}\frac{\partial}{\partial \varphi}\mathrm{log}q({\theta}_{0}y)$$  (59) 
and for $\mathrm{\ell}>0$
$${v}_{\mathrm{\ell}}={\left(\frac{{w}_{\mathrm{\ell}}}{{\sum}_{m=0}^{L}{w}_{m}}\right)}^{2}\frac{\partial \mathrm{log}{w}_{\mathrm{\ell}}}{\partial {\theta}_{\mathrm{\ell}}}\frac{\partial {\theta}_{\mathrm{\ell}}}{\partial \varphi}.$$  (60) 
A.2 Alternative gradient
We begin with an observation: the true integrand when computing the EIG as an expectation over $p({\theta}_{0})p(y{\theta}_{0},\xi )$ is given by
$${g}_{*}(y,{\theta}_{0},\xi )=\mathrm{log}\frac{p(y{\theta}_{0},\xi )}{p(y\xi )}.$$  (61) 
Recall the score function identity
$${\mathbb{E}}_{p(x\xi )}\left[\frac{\partial}{\partial \xi}\mathrm{log}p(x\xi )\right]=0.$$  (62) 
We have
${\mathbb{E}}_{p(\theta )p(y\theta ,\xi )}$  $\left[{\displaystyle \frac{\partial {g}_{*}}{\partial \xi}}\right]$  (63)  
$={\mathbb{E}}_{p(\theta )p(y\theta ,\xi )}\left[{\displaystyle \frac{\partial}{\partial \xi}}\mathrm{log}{\displaystyle \frac{p(y\theta ,\xi )}{p(y\xi}}\right]$  (64)  
$\begin{array}{cc}& ={\mathbb{E}}_{p(\theta )}\left({\mathbb{E}}_{p(y\theta ,\xi )}\left[{\displaystyle \frac{\partial}{\partial \xi}}p(y\theta ,\xi )\right]\right)\hfill \\ & {\mathbb{E}}_{p(y\xi )}\left[{\displaystyle \frac{\partial}{\partial \xi}}\mathrm{log}p(y\xi )\right]\hfill \end{array}$  (65)  
$=0$  (66) 
by two applications of the score function identity. This suggests that, as $g$ becomes close to ${g}_{*}$, the $\partial g/\partial \xi $ term in (16) has expectation close to zero, and primarily contributes variance to the gradient estimator.
Theorem 2 justifies removing the $\partial g/\partial \xi $ term. Removing this term is equivalent to the following gradientcoordinate algorithm. First, we choose the family ${f}_{\psi}(\theta ,y)$ to be $p(y\theta ,\psi )$. Then at time step $t$ we do the following

1.
Set ${\psi}_{t}={\xi}_{t}$

2.
Take a gradient step with respect to $(\xi ,\varphi )$ to update ${\xi}_{t},{\varphi}_{t}$
Importantly, the new gradient does not include a $\partial g/\partial \xi $ term, but is the gradient of a valid lower bound on EIG.
Appendix B EXPERIMENTS
B.1 Death process
Method  EIG mean $\pm 1$ s.e. 

ACE  $\mathbf{0.9830}\pm \mathbf{0.0001}$ 
PCE  $0.9822\pm 0.0001$ 
BA  $0.9822\pm 0.0002$ 
ACE without RB  $0.9789\pm 0.0006$ 
PCE without RB  $0.9710\pm 0.0025$ 
BA without RB  $0.9322\pm 0.0045$ 
BO with NMC  $0.9732\pm 0.0009$ 
We place the prior $\theta \sim \text{LogNormal}(0,1)$ on the infection rate and have the likelihood
$\begin{array}{cc}\hfill {I}_{1}& \sim \text{Binomial}(N,{e}^{\theta {\xi}_{1}})\hfill \\ \hfill {I}_{2}& \sim \text{Binomial}(N{I}_{1},{e}^{\theta {\xi}_{2}}).\hfill \end{array}$  (67) 
We also have the constraint ${\xi}_{1},{\xi}_{2}\ge 0$.
For each method, we fixed a computational budget of 120 seconds, and did 100 independent runs. For gradient methods, we used the Adam optimizer with learning rate ${10}^{3}$ and the default momentum parameters. The inference network made a separate Gaussian approximation to the posterior for each of the 66 outcomes. To evaluate $I(\xi )$ for comparison we used NMC with a large number of samples: 20000 for Figure 2 and 200000 for the final values in the caption and in Table 3. For the BO, we used a Matern52 kernel with variance 1 and lengthscale 0.25, and the GBUCB1 algorithm for acquisition.
We used the following number of samples for our RaoBlackwellized estimators
Method  Number of samples 

ACE  10 + 660 
PCE  10 
BA  10 
NMC  2000 
B.2 Regression
We consider the following prior on $\theta =(\mathbf{w},\sigma )$
${w}_{j}$  $\stackrel{\text{i.i.d.}}{\sim}\text{Laplace}(1)\text{for}j=1,\mathrm{\dots},p$  (68)  
$\sigma $  $\sim \text{Exponential}(1)$  (69) 
with the likelihood
$${y}_{i}\sim N(\sum _{j=1}^{p}{\xi}_{ij}{w}_{j},\sigma )\text{for}i=1,\mathrm{\dots}n.$$  (70) 
This represents a standard regression model, although with nonGaussian prior distributions we cannot compute the posterior or true EIG analytically. To ensure the EIG has a finite maximum, we impose the following constraint
$$\sum _{j}{\xi}_{ij}=1\text{for}i=1,\mathrm{\dots},n.$$  (71) 
In practice, we set $n=p=20$.
For each of our five methods, we fixed the computational budget to 15 minutes and did 10 independent runs. For gradient methods, we used a learning rate of ${10}^{3}$ and the Adam optimizer (kingma2014adam) with default momentum parameters. The inference network used the following variational family
$\mathbf{w}$  $\sim N(\bm{\mu},s{\mathrm{\Sigma}}_{0})$  (72)  
$\sigma $  $\sim \mathrm{\Gamma}(\alpha ,\beta )$  (73) 
and we used a neural network with the following architecture
Operation  Size  Activation 

Input $\to $ H1  64  ReLU 
H1 $\to $ H2  64  ReLU 
H2 $\to \bm{\mu}$  20   
H2 $\to (\alpha ,\beta )$  2  Softplus 
H2 $\to s$  1  Softplus 
${\mathrm{\Sigma}}_{0}$  $20\times 20$   
For the other methods, point evaluations of $I(\xi )$ were made using VNMC. Each VNMC evaluation took 1000 steps, with the optimization as above (but with $\xi $ fixed). We used a GP with Matern52 kernel with lengthscale 5, variance 10. We used a GPUCB1 acquisition rule, and terminated once 15 minutes had passed. For random search, we sampled designs using a standard unit Gaussian.
We used the following number of samples
Method  Inner samples $L$  Outer samples $N$ 

ACE  10  10 
PCE  10  10 
BA  n/a  100 
VNMC  10  10 
To evaluate designs, we used ACE/VNMC. We first trained ACE using the same procedure as above, for 20000 steps. Then we made the final ACE/VNMC evaluations using the fixed inference network and $L=2.5\times {10}^{3}$ inner samples, $N={10}^{5}$ outer samples.
B.3 Advertising
We introduce a LogNormal likelihood and a $D$dimensional latent variable $\bm{\theta}$ governed by a Normal prior, the joint density of our model is
$$p(\bm{y},\bm{\theta}\bm{\xi})=\mathcal{L}\mathcal{N}(\bm{y}\bm{\theta}\odot \bm{\xi},{\sigma}^{2}\bm{\xi})\mathcal{N}(\bm{\theta}\mathrm{\U0001d7ce},{\mathbf{\Lambda}}_{0})$$  (74) 
where $\sigma $ controls the observation noise and ${\mathbf{\Lambda}}_{0}$ is a nondiagonal precision matrix. Since there are correlations among the $D$ regions, the optimal advertising budget (w.r.t. gaining information about $\bm{\theta}$) allocates more money to the regions that are tightly correlated.
Throughout we assume that the number of regions $D$ is even. We set the budget to scale with the number of dimensions, $B=\frac{D}{2}$, set $\sigma =1$ and choose the prior precision matrix to be
$${\mathbf{\Lambda}}_{0}=(1+\frac{1}{D}){\mathbb{I}}_{D}\frac{1}{D}\bm{u}{\bm{u}}^{\mathrm{T}}\mathit{\hspace{1em}\hspace{1em}}{\bm{u}}^{\mathrm{T}}\equiv (\alpha ,\mathrm{\dots},\alpha ,1,\mathrm{\dots},1)$$ 
where the first (last) $\frac{D}{2}$ components of $\bm{u}$ are given by $\alpha $ (1), respectively, and where $\alpha =0.1$ controls (as we shall see) the degree of asymmetry in the optimal design. Discarding an irrelevant constant, we can compute the exact EIG using the formula:
$$I(\bm{\xi})=\frac{1}{2}\mathrm{log}det{\mathbf{\Lambda}}_{\mathrm{post}}\mathit{\hspace{1em}\hspace{1em}}{\mathbf{\Lambda}}_{\mathrm{post}}={\mathbf{\Lambda}}_{0}+\frac{1}{{\sigma}^{2}}\mathrm{diag}(\bm{\xi})$$ 
Using the matrix determinant lemma for rank1 matrix updates we can then compute
$$\begin{array}{cc}& \mathrm{log}\mathrm{det}{\mathbf{\Lambda}}_{\mathrm{post}}=\sum _{i=1}^{D}\mathrm{log}(1+\frac{1}{D}+{\xi}_{i})+\hfill \\ & \mathrm{log}\left(1\sum _{i=1}^{{\scriptscriptstyle \frac{D}{2}}}\left\{\frac{{\alpha}^{2}}{1+{\scriptscriptstyle \frac{1}{D}}+{\xi}_{i}}\right\}\sum _{i=1+{\scriptscriptstyle \frac{D}{2}}}^{D}\left\{\frac{1}{1+{\scriptscriptstyle \frac{1}{D}}+{\xi}_{i}}\right\}\right)\hfill \end{array}$$ 
By symmetry the optimum (it is easy to check that it is a maximum) of $\mathrm{EIG}(\bm{\xi})$ will satisfy ${\xi}_{i}={\xi}_{i+1}$ for $i=1,\mathrm{\dots},\frac{D}{2}1,\frac{D}{2}+1,\mathrm{\dots},D$. In other words $\bm{\xi}$ is entirely specified by ${\xi}_{1}$ and ${\xi}_{D}$, which must satisfy ${\xi}_{1}+{\xi}_{D}=1$ because of the constraint on the budget $B=\frac{D}{2}$. Thus we have reduced the EIG maximization problem to a univariate optimization problem that can easily be solved to machine precision, for example by gradient methods or brute force bisection.
For each of the four methods, we fix the computational budget to 120 seconds per design optimization. For the gradientbased methods this corresponds to $10\times {10}^{3}$, $20\times {10}^{3}$, and $18\times {10}^{3}$ gradient steps for ACE, PCE, and Posterior, respectively. For the Bayesian Optimization baseline, we run 110 steps of a GPUCBlike algorithm (srinivas2009gaussian) in batchmode, resulting in a total budget of $1650$ function evaluations of the EIG oracle. Note that for all four methods the runtime dependence on the dimension $D$ is negligible in the regime in which we are operating; consequently we use the same number of gradient or BO steps for all $D$.
For the gradientbased methods, we use the kingma2014adam optimizer with default momentum hyperparameters and an initial learning rate of ${\mathrm{\ell}}_{0}=0.1$ that is exponentially decayed towards a final learning rate ${\mathrm{\ell}}_{f}$ that depends on the particular method. In particular we set ${\mathrm{\ell}}_{f}=1\times {10}^{4}$, ${\mathrm{\ell}}_{f}=1\times {10}^{5}$, and ${\mathrm{\ell}}_{f}=3\times {10}^{4}$ for the ACE, PCE, and Posterior methods, respectively. For the BO baseline, we used a Matern kernel with a fixed length scale $\mathrm{\ell}=0.2$. These hyperparameters were chosen by running a grid search with $D=16$ and choosing hyperparameters that minimized the mean absolute EIG error.
Finally we note that in Fig. 3 at each dimension $D$ we normalize the EIG by the factor
$$Z=\mathrm{EIG}({\bm{\xi}}^{*})\mathrm{EIG}({\bm{\xi}}_{\mathrm{uniform}})$$  (75) 
where ${\bm{\xi}}^{*}$ and ${\bm{\xi}}_{\mathrm{uniform}}$ are the optimal and uniform budget designs, respectively. Consequently after normalization the absolute error for the uniform budget design ${\bm{\xi}}_{\mathrm{uniform}}$ is equal to unity.
B.4 Biomolecular docking
For the docking model, we used the following independent priors
top  $\sim \text{Beta}(25,75)$  (76)  
bottom  $\sim \text{Beta}(4,96)$  (77)  
ee50  $\sim N(50,{15}^{2})$  (78)  
slope  $\sim N(0.15,{0.1}^{2}).$  (79) 
For the design $\xi =({\xi}_{1},\mathrm{\dots},{\xi}_{100})$ we had 100 binary responses
$${y}_{i}\sim \text{Bern}\left(\text{bottom}+\frac{\text{top}\text{bottom}}{1+{e}^{({\xi}_{i}\text{ee50})\times \text{slope}}}\right).$$  (80) 
For gradient methods, we used the Adam optimizer (kingma2014adam) with learning rate ${10}^{3}$ and default momentum parameters. For each method, we took $5\times {10}^{5}$ gradient steps (each method converged within this number of steps). The inference network was meanfield with the same distributional families as the prior. We used the following neural architecture
Operation  Size  Activation 

Input $\to $ H1  64  ReLU 
H1 $\to $ H2  64  ReLU 
H2 $\to $ top  2  Softplus 
H2 $\to $ bottom  2  Softplus 
H2 $\to $ ee50 mean  1   
H2 $\to $ ee50 s.d.  1  Softplus 
H2 $\to $ slope mean  1   
H2 $\to $ slope s.d.  1  Softplus 
We used the following number of samples
Method  Inner samples $L$  Outer samples $N$ 

ACE  10  10 
PCE  10  10 
BA  n/a  100 
For the expert method, the design of lyu2019ultra, which comprised 580 compounds, was subsampled to comprise 100 compounds for a fair comparison.
For evaluation, we used ACE/VNMC, first training ACE for 25000 steps using the same learning rate as above. With the fixed inference network, we made ACE and VNMC evaluations using $L=2\times {10}^{3}$ inner samples, $N=4\times {10}^{6}$ outer samples—a total of 9 billion samples.
B.5 Constant elasticity of substitution
We used the exact setup of foster2019variational. Specifically, we take $U(\mathbf{x})={\left({\sum}_{i}{x}_{i}^{\rho}{\alpha}_{i}\right)}^{1/\rho}$ and place the following priors on $\rho ,\bm{\alpha},u$
$\rho $  $\sim \text{Beta}(1,1)$  (81)  
$\bm{\alpha}$  $\sim \text{Dirichlet}([1,1,1])$  (82)  
$\mathrm{log}u$  $\sim N(1,3)$  (83)  
$\eta $  $\sim N(u\cdot (U(\mathbf{x})U({\mathbf{x}}^{\prime}),{0.005}^{2}{u}^{2}{(1+\parallel \mathbf{x}{\mathbf{x}}^{\prime}\parallel )}^{2})$  (84)  
$y$  $=f(\eta )$  (85) 
where $f$ is the censored sigmoid function. All designs $\xi =(\mathbf{x},{\mathbf{x}}^{\prime})$ were constrained to ${[0,100]}^{6}$.
For gradient methods, we used the Adam optimizer with learning rate ${10}^{3}$ and default momentum parameters. To make the design process 120 seconds per step, we used the following number of gradient steps
Method  Number of steps 

ACE  1500 
PCE  2500 
BA  5000 
We found that there was insufficient time to effectively train a neural network guide. Instead we used a meanfield variational family with the same distributional families as the prior, and a linear model using the following features: $\text{logit}(y),\mathrm{log}\text{logit}(y),\mathrm{\U0001d7cf}(y>0.5)$.
We used the following number of samples
Method  Inner samples $L$  Outer samples $N$ 

ACE  10  10 
PCE  10  10 
BA  n/a  100 
For the baseline, we used the marginal upper bound of foster2019variational with the same variational family used in that paper—an $f$transformed Normal with additional point masses at the endpoints. We used a GP with a Matern52 kernel, lengthscale 20, variance set from data, and a GPUCB1 algorithm to make acquisitions which were done in batches of 8.
At each stage, the posterior was fitted using meanfield variational inference using the same distributional families as the prior.