We introduce a fully stochastic gradient based approach to Bayesian optimalexperimental design (BOED). This is achieved through the use of variationallower bounds on the expected information gain (EIG) of an experiment that canbe simultaneously optimized with respect to both the variational and designparameters. This allows the design process to be carried out through a singleunified stochastic gradient ascent procedure, in contrast to existingapproaches that typically construct an EIG estimator on a pointwise basis,before passing this estimator to a separate optimizer. We show that this, inturn, leads to more efficient BOED schemes and provide a number of a differentvariational objectives suited to different settings. Furthermore, we show thatour gradient-based approaches are able to provide effective design optimizationin substantially higher dimensional settings than existing approaches.
Quick Read (beta)
We introduce a fully stochastic gradient based approach to \acrfullOED. This is achieved through the use of variational lower bounds on the \acrfullEIG of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct an \acrshortEIG estimator on a pointwise basis, before passing this estimator to a separate optimizer. We show that this, in turn, leads to more efficient \acrshortOED schemes and provide a number of a different variational objectives suited to different settings. Furthermore, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.
algocfAlgorithmAlgorithms \crefnamealgorithmAlgorithmAlgorithms \crefnamefigureFigureFigure \crefnamesection§§§ \Crefnamesection§§§ \crefformatequation(#2#1#3) \glsdisablehyper \newacronymKLklKullback-Leibler \newacronymSGDsgdstochastic gradient descent \newacronymOEDBOEDBayesian optimal experimental design \newacronymMIMImutual information \newacronymEIGEIGexpected information gain \newacronymNMCNMCnested Monte Carlo \newacronymACEACEadaptive contrastive estimation \newacronymNCENCEnoise contrastive estimation \newacronymBABABarber-Agakov \newacronymBOBOBayesian optimization \newacronymMCMCMonte Carlo \newacronymRMSERMSEroot mean squared error
A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments
Adam Foster† Martin Jankowiak‡ Matthew O’Meara§ Yee Whye Teh† Tom Rainforth†✠
†Department of Statistics University of Oxford Oxford, UK ‡Uber AI San Francisco, CA, USA §University of Michigan Ann Arbor, MI, USA ✠Christ Church University of Oxford Oxford, UK
The design of experiments is a key problem in almost every scientific discipline. Namely, one wishes to construct an experiment that is most informative about the process being investigated, while minimizing its cost. For example, in a psychological trial, we want to ensure questions posed to participants are pertinent and do not have predictable responses. In a pharmaceutical trial, we want to minimize the number of participants needed to successfully test our hypotheses. In an online automated help system, we want to ensure we ask questions that identify the user’s problem as quickly as possible.
In all these scenarios, our ultimate high-level aim is to choose designs that maximize the information gathered by the experiment. A powerful and broadly used approach for formalizing this aim is \acrfullOED (chaloner1995; lindley1956; myung2013). In \acrshortOED, we specify a Bayesian model for the experiment and then choose the design that maximizes the \acrfullEIG from running it. More specifically, let denote the latent variables we wish to learn about from running the experiment and let represent the experimental design. By introducing a prior and a predictive distribution for experiment outcomes , we can calculate the \acrshortEIG under this model by taking the expected reduction in posterior entropy
where represents the entropy of a distribution and . Our experimental design process now becomes that of the finding the design that maximizes .
Unfortunately, finding is typically a very challenging problem in practice. Even evaluating for a single design is computationally difficult because it represents a nested expectation and thus has no direct \acrlongMC estimator (nmc). Though a large variety of approaches for performing this estimation have been suggested (myung2013; watson2017quest+; kleinegesse2018efficient; foster2019variational), the resulting \acrshortOED strategies share a critical common feature: they estimate on a point-by-point basis and feed this estimator to an outer-level optimizer that selects which points to evaluate.
This framework can be highly inefficient for a number of reasons. For example, it adds an extra level of nesting to the overall computation process: must be separately estimated for each , substantially increasing the overall computational cost. Furthermore, one must typically resort to gradient-free methods to carry out the resulting optimization, which means it is difficult to scale the overall \acrshortOED process to high dimensional design settings due to a dearth of optimization schemes which remain effective in such settings.
To alleviate these inefficiencies and open the door to applying \acrshortOED in high-dimensional settings, we introduce an alternative to this two-stage framework by introducing unified objectives that can be directly maximized to simultaneously estimate and optimize . Specifically, by building on the work of foster2019variational, we construct variational lower bounds to that can be simultaneously optimized with respect to both the variational and design parameters. Optimizing the former ensures that we achieve a tight bound that in turn gives accurate estimates of , while simultaneously optimizing the latter circumvents the need for an expensive outer optimization process. Critically, this approach allows the optimization to be performed using stochastic gradient ascent (SGA) (robbins1951stochastic) and therefore scaled to substantially higher dimensional design problems than existing approaches.
To account for the varying needs of different problem settings, we introduce several classes of suitable variational lower bounds. Most notably, we introduce the \acrfullACE bound: an \acrshortEIG variational lower bound that can be made arbitrarily tight, while remaining amenable to simultaneous SGA on both the variational parameters and designs.
We demonstrate the applicability of our unified gradient approach using a wide range of experimental design problems, including a real-world high-dimensional example from the pharmacology literature (lyu2019ultra). We find that our approaches are able to effectively optimize the \acrshortEIG, consistently outperforming baseline two-stage approaches, with particularly large gains achieved for high-dimensional problems. These gains lead, in turn, to improved designs and more informative experiments.
2.1 Bayesian optimal experimental design
When experimentation is costly, time consuming, or dangerous, it is essential to design experiments to learn the most from them. To choose between potential designs, we require a metric of the quality of a candidate design. In the \acrshortOED framework dating back to lindley1956, this metric represents how much more certain we will become in our knowledge of the world after doing the experiment and analyzing the data. We prefer designs that will lead to strong conclusions even if we are not yet sure what those conclusions will be.
Specifically, the \acrshortOED framework begins with Bayesian model of the experimental process. This is defined through the combination of a likelihood that predicts the experimental outcome under design and latent variable , and a prior which incorporates initial beliefs about the unknown . After conducting the experiment, our beliefs are updated to the posterior . The information gained about from doing the experiment with design and obtaining outcome is the reduction in entropy from the prior to the posterior
As it stands, information gain cannot be evaluated until after the experiment. To define a metric that will let us choose between designs before experimentation, we define the expected information gain (EIG), , by taking the expectation of IG over hypothesized outcomes using the marginal distribution under our model, , to give
which can be rewritten in the form of a mutual information between and with fixed, namely
The Bayesian optimal design, , is now the one which maximizes EIG over the set of feasible designs
In iterated experiment design, we design a sequence of experiments. At time , the prior in (4) is replaced by the posterior given the previous experiments and their observed outcomes , namely
This now provides a means to design a sequence of adaptive experiments, wherein we use information gathered from previous iterations to improve the designs used at future iterations.
2.2 Estimating expected information gain
Making even a single point estimate of \acrshortEIG when solving (5) can be challenging because we must first estimate the unknown or , and then take an expectation over . Nested Monte Carlo (NMC) estimators (nmc), which make a Monte Carlo approximation of both the inner and outer integrals, converge relatively slowly: at a rate in the total computational budget .
foster2019variational noted that this approach is inefficient because it makes a separate \acrlongMC approximation of the integrand for every sample of the outer integral. To share information between different samples, they proposed a number of variational estimators that used amortization, i.e. they attempted to learn the functional form of the integrand rather than approximating it afresh each time. One of their approaches was based on amortized variational inference and required an inference network which takes as input and outputs a distribution over . For any , we can construct a lower bound on . This is the \acrfullBA, or posterior, lower bound (ba)
which has also found use representation learning (poole2018variational) and maximizing information transmission in noisy channels (ba).
To make high-quality approximations to , and simultaneously learn a good posterior approximation, foster2019variational maximize this bound with respect to . This approach is most effective when the bound is tight, i.e. . For , this occurs when it is possible to have , i.e. when the inference network is powerful enough to find the true posterior distribution for every .
The required optimization is performed using SGA (robbins1951stochastic), which requires unbiased gradient estimators. For , this is obtained by drawing Monte Carlo samples and forming the estimator
To obtain high-quality approximations of even when the inference network cannot capture the true posterior, foster2019variational also considered another variational estimator: variational nested Monte Carlo (VNMC). This uses the inference network in conjunction with additional samples to improve the estimate of the integrand. They showed that this leads to the following upper bound on
where the expectation is over . The inference network in VNMC is trained by minimization, in the same way is trained by maximization. has the attractive feature that the bound becomes tight as the number of inner samples becomes large, i.e. , even if is not powerful enough to directly represent the true posterior.
2.3 Optimizing the EIG
The experimental design problem is to find the design that maximizes the EIG. Therefore, as well as finding a way to estimate EIG, existing approaches subsequently need to find a way of searching across to find promising designs. At a high-level, most existing approaches propose a two-stage procedure in which noisy estimates of are made, and a separate optimization procedure selects the candidate design to evaluate next.
kleinegesse2018efficient and foster2019variational both use \acrfullBO for this outer optimization step, a black-box optimization method (snoek2012practical) that is tolerant to noise in the estimates of the objective function, in this case . Some approaches (watson2017quest+; lyu2019ultra) instead select a finite number of candidate designs in and estimate at each candidate, with some refining this process further by adaptively allocating computational resources between these designs (vincent2017; rainforth2017thesis). Another suggested approach is to use MCMC methods to carry out this outer optimization (amzal2006bayesian; muller2005simulation).
3 GRADIENT-BASED DESIGN OPTIMIZATION
Our central proposal is to replace the two-stage procedure outlined above with a single stage that both estimates and optimizes simultaneously. This has the critical advantage of allowing SGA to be directly applied to the design optimization. Not only does this provide substantial computational gains over approaches which must construct separate estimates for each hypothetical design considered, but it also provides the potential to scale to substantially higher dimensional design problems than those which can be effectively tackled with existing approaches. Since we take gradients with respect to , we henceforth assume that is continuous.
In our approach, we utilize variational lower bounds on . Specifically, suppose we have a bound with variational parameters . For fixed , the estimate of improves as we maximize with respect to . We propose to maximize jointly with respect to . As we train , the variational approximation improves; as we train our design moves to regions where the lower bound on \acrshortEIG is largest. By tackling this as a single optimization problem over , we obviate the need to have an outer optimizer for . Using a lower bound is important because it allows us to perform a single maximization over , rather than a more complex optimization such as the max-min optimization that would result if we used an upper bound.
In practice, we do not have lower bounds on that we can evaluate and differentiate in closed form. Instead, we have lower bounds in the form of expectations over . Fortunately, we can still maximize the lower bound with respect to in this case by using SGA. Stochastic gradient methods have been applied in machine learning to train millions of parameters (bottou2010large) and are a good fit for high-dimensional experimental design.
We now make our first concrete proposal for the lower bound : the BA bound , as defined in (7). We now optimize jointly where previously only was trained using gradients. To perform SGA, we require unbiased estimators for both and . We use equation (8) to estimate the -gradient. For the -gradient we have
where , which is an unbiased estimator of . This is a score function estimator (williams1992simple) and other possibilities are discussed in Section 3.6.
3.2 Adaptive contrastive estimation (ACE)
The BA bound provides one specific case of our one-stage procedure for optimal experimental design. We now introduce a new lower bound that improves upon . The potential issue with the BA bound is that it may not be sufficiently tight, which happens when the inference network cannot represent the true posterior. One possible solution is to introduce additional samples, as in the VNMC estimator (9). However, we cannot use VNMC directly for a one-stage procedure: since it is an upper bound, we must minimize it with respect to , but we still wish to maximize with respect to . This precludes a joint maximization over .
Looking more closely at the VNMC bound, we see that its main failure case is when the denominator strongly under-estimates , which can happen when all the inner samples miss regions where the joint is large. In addition to the samples , we also have the original sample from which was sampled. One way to avoid the under-estimation in the denominator would be to include this sample, giving
where the expectation is with respect to . The samples can now be seen as contrasts to the original sample . For this reason, we call contrastive samples and we call (11) the adaptive contrastive estimate (ACE) of EIG. The following theorem establishes that is a valid lower bound on the EIG and that the bound becomes tight as .
For any model and inference network , we have the following:
is a lower bound on the \acrshortEIG and we can characterize the error term as an expected KL divergence:
As , we recover the true \acrshortEIG:
The ACE bound is monotonically increasing in : for .
If the inference network equals the true posterior , then .
The proof is presented in Appendix A, along with Theorem 3 which concerns properties of the maximum of the ACE bound. Gradient estimation for ACE is discussed in Section 3.6. We note that, to the best of our knowledge, has not previously appeared in the \acrshortOED literature.11 1 Aside from a recent blog post (sobolev2019thoughts) we do not believe this bound has previously been studied in any context.
3.3 Prior contrastive estimation (PCE)
Theorem 1 tells us that can become close to in two scenarios: 1) the inference network becomes close to the true posterior , 2) we increase the number of contrastive samples . The BA bound only becomes tight in case 1). A special case of ACE is to replace the inference network with a fixed distribution and rely on the contrastive samples to make good estimates of , only becoming tight in case 2), i.e. as . This simplification can speed up training, since we no longer need to learn additional parameters .
To explore this, we propose the prior contrastive estimation (PCE) bound, in which the prior is used to generate contrastive samples:
where the expectation is over . Whilst inherently less powerful than ACE, PCE can be effective when the prior and posterior are not too different, so is a suitable proposal to estimate .
Though, to the best of our knowledge, this bound has not been applied to \acrshortOED before, we note that it shares a connection to the information noise contrastive estimation (InfoNCE) bound on mutual information used in representation learning (oord2018representation). Given data samples , corresponding representations , and a learned critic , we have
where the expectation is over , is the data distribution, and is the encoder. poole2018variational showed that the encoder density is the optimal critic, although it is rarely known in closed form in the representation learning context. Writing for and for , we note the mathematical connection between this optimal case and .
3.4 Likelihood-free ACE
In some models such as random effects models, the likelihood is not known in closed form but can be sampled from. This presents a problem when computing or its derivatives because the likelihood appears in (11). To allow ACE to be used for these kinds of models, we now show that using a unnormalized approximation to the likelihood still results in a valid lower bound on \acrshortEIG. In fact, if using a parametrized likelihood approximation , it is then possible to train jointly with to approximate the likelihood, learn an inference network, and find the optimal design through the solution to a single optimization problem. The following theorem, whose proof is presented in Appendix A, shows that replacing the likelihood with an unnormalized approximation does result in a valid lower bound on \acrshortEIG.
Consider a model and inference network . Let be an unnormalized likelihood approximation. Then,
where the expectation is over .
3.5 Iterated experimental design with ACE
In iterated experimental design, we replace by as per (6). We can sample by performing inference. Whilst variational inference also provides a closed form estimate of the posterior density, some other inference methods do not. This is problematic because the prior density appears in (11). Fortunately, it is sufficient to know the density up to proportionality (foster2019variational). Indeed if where does not depend on and is an unnormalized density, then
and the derivatives of are simply zero.
3.6 Gradient estimation for ACE
To optimize the ACE bound with respect to it is necessary to have unbiased gradient estimators of and .
The simplest form of the -gradient is
where the expectation is with respect to , and
Estimating the expectation (16) directly using Monte Carlo gives the score function, or REINFORCE, estimator (williams1992simple). Unfortunately, this is often high variance, and reducing gradient estimator variance is often important in solving challenging experimental design problems.
One variance reduction method is reparameterization. For this, we introduce random variables whose distributions do not depend on along with representations of and as deterministic functions of these variables:
This now permits a reparameterized form of the gradient:
where the expectation is over . A Monte Carlo approximation of this expectation is typically a much lower variance estimator for the true -gradient.
Alternatively, if is a discrete random variable we can sum over the possible values . This approach is known as Rao-Blackwellization and gives
where the expectation is now over .
In our experiments, we learn optimal experimental designs in five scenarios: the death process, a well known two-dimensional design problem from epidemiology; a non-conjugate regression model with a 400-dimensional design; an ablation study in the setting of advertising; a real-world biomolecular docking problem from pharmacology in 100 dimensions; and a constant elasticity of substitution iterated experimental design problem in behavioural economics with 6 dimensional designs.
4.1 Evaluating experimental designs
We first discuss which metrics we will use to judge the quality of the designs we obtain.
Our primary metric on designs is, of course, the \acrshortEIG. We prefer designs with high \acrshortEIGs. In some cases, we can evaluate the \acrshortEIG analytically. In other cases, we can use a sufficiently large number of samples in a \acrshortNMC (nmc) estimator to be sure that we have estimates that are sufficiently accurate to compare designs.
However, to explore the limits of our methods, we will also consider scenarios where neither of these approaches is suitable. In these cases, we pair the ACE lower bound (with fixed for evaluation) with the VNMC upper bound (foster2019variational) to trap the true \acrshortEIG value—if the lower bound of one design is higher than the upper bound for another, we can be sure that the first design is superior (noting that the bounds themselves can be tractably estimated to a very high accuracy).
In some settings, when we know the true optimal design , we will also consider the design error , i.e. how close our design is to the optimal design.
In iterated experiment design, as well as designing experiments, we must also perform inference on the latent variable after each iteration. Here, we also investigate the quality of the final posterior. Specifically, if is the posterior after experiments, we use the posterior entropy, and the posterior RMSE . We prefer low entropy and low RMSE values.
4.2 Death process
We consider an example from epidemiology, the death process (cook2008optimal; kleinegesse2018efficient), in which a population of individuals transitions from healthy to infected states at a constant but unknown rate . We can measure the number of infected individuals at two different times and where . Our aim is to infer the infection rate from these observations. For full details of the prior and likelihood used, see Appendix B.1.
On this problem, we apply gradient methods with Rao-Blackwellization (there are 66 possible outcomes). Figure 1 shows a sample optimization trajectory with the approximate EIG surface for illustration. We compare against \acrshortBO using the Rao-Blackwellized NMC estimator of vincent2017. Figure 2 shows that, for the allowed time budget, all gradient methods perform better than BO even on this two-dimensional problem.
We now compare our one-stage gradient approaches to experimental design against a two-stage baseline on a high-dimensional design problem. We choose a general purpose Bayesian linear regression model with observations and features. The design is an matrix; the latent variables are , where is the dimensional regression coefficient and is the scalar variance. The outcomes are generated using a Normal likelihood for . Here is the th row of . To avoid trivial solutions, we enforce the constraint for all . We use independent priors for and . See Appendix B.2 for complete details.
We set and applied five methods to this 400 dimensional design problem: BA, ACE and PCE, as well as the VNMC estimator of foster2019variational, with both \acrshortBO and random search to optimize over . The results are presented in Table 1. We note that the gradient methods strongly outperform the gradient-free baselines, with about double the final EIG.
|EIG l.b.||EIG u.b.|
|BO + VNMC|
|Random Search + VNMC|
We now conduct a detailed ablation study on the effects of dimension on the quality of experimental designs produced using our gradient approaches and \acrshortBO. To further isolate the distinction between one-stage and two-stage approaches to \acrshortOED, we choose a setting in which we can compute analytically. We give \acrshortBO, but not the gradient methods, access to the EIG oracle when making point evaluations of . Thus we put \acrshortBO in the best possible position and ensure any gains are due to improvements from using gradient-based optimization.
Suppose that we are given an advertising budget of dollars that we need to allocate among regions, i.e. we choose with . After conducting an ad campaign, we observe a vector of sales . We use this data to make inferences about the underlying market opportunities in each region. Our prior incorporates the knowledge that neighbouring regions are more correlated than distant ones—this leads to an interesting experimental design problem because information can be pooled between regions. We can also compute the true EIG and optimal design analytically. For full details, see Appendix B.3.
We compare the performance of four estimation and optimization methods on this problem, see Fig. 3 for the results. The three gradient-based methods (ACE, PCE, BA) perform best, with the BO baseline struggling in dimensions , even though the latter has access to an EIG oracle. PCE performed well in low dimensions, but degraded as the dimension increases and sampling from the prior becomes increasingly inefficient—something that ACE and BA avoid by learning adaptive proposal distributions. We note that since in this case the family of variational distributions used in ACE and BA include the true posterior, both methods yield similar performance.
4.5 Biomolecular docking
We now consider an experimental design problem of interest to the pharmacology community. Having demonstrated that our one-stage gradient methods compare favourably with two stage approaches, we now compare against designs crafted by domain experts.
In molecular docking, computational techniques are used to predict the binding affinity between a compound and a receptor. When synthesized in the lab, the two may bind—this is called a hit. Learning a well-calibrated hit-rate model can guide how many compounds to evaluate for additional objectives, such as drug-likeness or toxicity, before experimental testing. lyu2019ultra modelled the probability of outcome being a hit, given the predicted binding affinity, or docking score, , as
where with priors given in Appendix B.4.
Of 150 million compounds, lyu2019ultra selected a batch of compounds to experimentally test to best fit the sigmoid hit-rate model. They considered 6 candidate designs and selected one that maximized the EIG estimated by NMC. Here, we instead apply gradient-based BOED to search across candidate designs which consist of 100 docking scores . To evaluate our final designs, we present upper and lower bounds on the final EIG: see Table 2. We see that all gradient methods are able to outperform experts in terms of \acrshortEIG, and that ACE appears the best of the gradient methods. Figure 5 shows our designs are qualitatively different to those produced by experts.
4.6 Constant elasticity of substitution
|Method||EIG lower bound||EIG upper bound|
We now turn to iterated experimental design in which we produce designs, generate data and make inference repeatedly. This problem therefore captures the end-to-end-process of experimentation and inference.
We consider an experiment in behavioural economics that was previously also considered by foster2019variational. In this experiment, a participant is asked to compare baskets of goods. The model assumes that their response (on a slider) is based on the difference in utility of the baskets, and the constant elasticity of substitution (CES) model (arrow1961capital) governed by latent variables is then used for this utility. The aim is to learn characterizing the participant’s utility. In the experiment, there are 20 sequential steps of experimentation with the same participant. We compare our gradient-based approach against the most successful approach of foster2019variational that approximates the marginal density to form an upper bound on EIG, and BO to optimize . For full details, see Appendix B.5.
Figure 4 shows that gradient-based methods are effective on this problem; both ACE and PCE decrease the posterior entropy and RMSEs on the latent variables faster and further than the baseline, whereas BA does not do so well. We suggest that the similar performance of ACE and PCE is due to the smaller changes in the posterior at middle and late steps, after much data has been accumulated: when the posterior does not change much at each step, forms an effective proposal for estimating .
We have introduced a new approach for Bayesian experimental design that does away with the two stages of estimating EIG and separately optimizing over . We use stochastic gradients to maximize a lower bound on and so find optimal designs by solving a single optimization problem. This unification leads to substantially improved performance, especially on high-dimensional design problems.
Of the three lower bounds, and , we note that in all five experiments ACE generally did as well as the better of BA and PCE: we therefore recommend it as the default choice. BA performed well when the inference network could closely approximate the true posterior; PCE performed well when the prior was an adequate proposal for estimating and does not require the training of variational parameters.
AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. TR gratefully acknowledges funding from Tencent AI Labs and a junior research fellowship supported by Christ Church, Oxford.
Appendix A GRADIENT-BASED DESIGN OPTIMIZATION
We begin with the proof of Theorem 1, which we restate for convenience.
To begin with 1., we have the error term which can be written
where the expectation is over . Note that the integrand is symmetric under a permutation of the labels , so its expectation will be the same under the distribution . This in turn implies that we can consider this as an expectation under
which is the expected KL divergence required. We therefore have .
For 2., we assume that is bounded. The ACE denominator is a consistent estimator of the marginal likelihood. Indeed,
as by the Strong Law of Large Numbers, since
This establishes the a.s. pointwise convergence of the ACE integrand to . Hence by Bounded Convergence Theorem,
To establish 3., we let . Then
where the expectation is over and
By symmetry, we can consider this as an expectation over (we can permute the labels ). We therefore recognise as the expectation of a KL divergence. Hence as required.
4. follows by Bayes Theorem, i.e.
which completes the proof. ∎
We also present the proof of Theorem 2.
Initially, we note that the contrastive samples do not carry additional information about . We consider the mutual information between and the random variable . Using the Chain Rule for mutual information we have
Now since () are conditionally independent of given . Therefore
We now use the Donsker-Varadhan representation of mutual information (donsker1975asymptotic). Specifically, for random variables with joint distribution and any measurable function we have
We now use this representation with and the integrand
We compute the second term in (36), .
where the second to last line follows by symmetry. This establishes that , and so (14) constitutes a valid lower bound on . That is
which completes the proof. ∎
The following theorem establishes a condition under which the maximum of the ACE objective converges to the maximum of the EIG as .
Consider a model such that
and . Let be an inference network and let
and in particular as .
We have since is a lower bound on by Theorem 1.
Next, we consider . We have
we write (46) as
and we apply the inequality to give
We now observe that for , and hence, taking a partial expectation over we have
as required. ∎
A.1 Double reparametrization
We have the -gradient
and the outer expectation is over . If is reparameterizable as a function of , then we can apply double reparameterization to this gradient. Indeed, were it not for the term, this would be exactly the IWAE of burda2015importance. We can exploit the double reparameterization of tucker2018doubly with a minor variation to account for to obtain a low variance gradient estimator.
The doubly reparametrized gradient for ACE takes the form
A.2 Alternative gradient
We begin with an observation: the true integrand when computing the EIG as an expectation over is given by
Recall the score function identity
by two applications of the score function identity. This suggests that, as becomes close to , the term in (16) has expectation close to zero, and primarily contributes variance to the gradient estimator.
Theorem 2 justifies removing the term. Removing this term is equivalent to the following gradient-coordinate algorithm. First, we choose the family to be . Then at time step we do the following
Take a gradient step with respect to to update
Importantly, the new gradient does not include a term, but is the gradient of a valid lower bound on EIG.
Appendix B EXPERIMENTS
B.1 Death process
|Method||EIG mean s.e.|
|ACE without RB|
|PCE without RB|
|BA without RB|
|BO with NMC|
We place the prior on the infection rate and have the likelihood
We also have the constraint .
For each method, we fixed a computational budget of 120 seconds, and did 100 independent runs. For gradient methods, we used the Adam optimizer with learning rate and the default momentum parameters. The inference network made a separate Gaussian approximation to the posterior for each of the 66 outcomes. To evaluate for comparison we used NMC with a large number of samples: 20000 for Figure 2 and 200000 for the final values in the caption and in Table 3. For the BO, we used a Matern52 kernel with variance 1 and lengthscale 0.25, and the GB-UCB1 algorithm for acquisition.
We used the following number of samples for our Rao-Blackwellized estimators
|Method||Number of samples|
|ACE||10 + 660|
We consider the following prior on
with the likelihood
This represents a standard regression model, although with non-Gaussian prior distributions we cannot compute the posterior or true EIG analytically. To ensure the EIG has a finite maximum, we impose the following constraint
In practice, we set .
For each of our five methods, we fixed the computational budget to 15 minutes and did 10 independent runs. For gradient methods, we used a learning rate of and the Adam optimizer (kingma2014adam) with default momentum parameters. The inference network used the following variational family
and we used a neural network with the following architecture
For the other methods, point evaluations of were made using VNMC. Each VNMC evaluation took 1000 steps, with the optimization as above (but with fixed). We used a GP with Matern52 kernel with lengthscale 5, variance 10. We used a GP-UCB1 acquisition rule, and terminated once 15 minutes had passed. For random search, we sampled designs using a standard unit Gaussian.
We used the following number of samples
|Method||Inner samples||Outer samples|
To evaluate designs, we used ACE/VNMC. We first trained ACE using the same procedure as above, for 20000 steps. Then we made the final ACE/VNMC evaluations using the fixed inference network and inner samples, outer samples.
We introduce a LogNormal likelihood and a -dimensional latent variable governed by a Normal prior, the joint density of our model is
where controls the observation noise and is a non-diagonal precision matrix. Since there are correlations among the regions, the optimal advertising budget (w.r.t. gaining information about ) allocates more money to the regions that are tightly correlated.
Throughout we assume that the number of regions is even. We set the budget to scale with the number of dimensions, , set and choose the prior precision matrix to be
where the first (last) components of are given by (1), respectively, and where controls (as we shall see) the degree of asymmetry in the optimal design. Discarding an irrelevant constant, we can compute the exact EIG using the formula:
Using the matrix determinant lemma for rank-1 matrix updates we can then compute
By symmetry the optimum (it is easy to check that it is a maximum) of will satisfy for . In other words is entirely specified by and , which must satisfy because of the constraint on the budget . Thus we have reduced the EIG maximization problem to a univariate optimization problem that can easily be solved to machine precision, for example by gradient methods or brute force bisection.
For each of the four methods, we fix the computational budget to 120 seconds per design optimization. For the gradient-based methods this corresponds to , , and gradient steps for ACE, PCE, and Posterior, respectively. For the Bayesian Optimization baseline, we run 110 steps of a GP-UCB-like algorithm (srinivas2009gaussian) in batch-mode, resulting in a total budget of function evaluations of the EIG oracle. Note that for all four methods the runtime dependence on the dimension is negligible in the regime in which we are operating; consequently we use the same number of gradient or BO steps for all .
For the gradient-based methods, we use the kingma2014adam optimizer with default momentum hyperparameters and an initial learning rate of that is exponentially decayed towards a final learning rate that depends on the particular method. In particular we set , , and for the ACE, PCE, and Posterior methods, respectively. For the BO baseline, we used a Matern kernel with a fixed length scale . These hyperparameters were chosen by running a grid search with and choosing hyperparameters that minimized the mean absolute EIG error.
Finally we note that in Fig. 3 at each dimension we normalize the EIG by the factor
where and are the optimal and uniform budget designs, respectively. Consequently after normalization the absolute error for the uniform budget design is equal to unity.
B.4 Biomolecular docking
For the docking model, we used the following independent priors
For the design we had 100 binary responses
For gradient methods, we used the Adam optimizer (kingma2014adam) with learning rate and default momentum parameters. For each method, we took gradient steps (each method converged within this number of steps). The inference network was mean-field with the same distributional families as the prior. We used the following neural architecture
|H2 ee50 mean||1||-|
|H2 ee50 s.d.||1||Softplus|
|H2 slope mean||1||-|
|H2 slope s.d.||1||Softplus|
We used the following number of samples
|Method||Inner samples||Outer samples|
For the expert method, the design of lyu2019ultra, which comprised 580 compounds, was subsampled to comprise 100 compounds for a fair comparison.
For evaluation, we used ACE/VNMC, first training ACE for 25000 steps using the same learning rate as above. With the fixed inference network, we made ACE and VNMC evaluations using inner samples, outer samples—a total of 9 billion samples.
B.5 Constant elasticity of substitution
We used the exact set-up of foster2019variational. Specifically, we take and place the following priors on
where is the censored sigmoid function. All designs were constrained to .
For gradient methods, we used the Adam optimizer with learning rate and default momentum parameters. To make the design process 120 seconds per step, we used the following number of gradient steps
|Method||Number of steps|
We found that there was insufficient time to effectively train a neural network guide. Instead we used a mean-field variational family with the same distributional families as the prior, and a linear model using the following features: .
We used the following number of samples
|Method||Inner samples||Outer samples|
For the baseline, we used the marginal upper bound of foster2019variational with the same variational family used in that paper—an -transformed Normal with additional point masses at the end-points. We used a GP with a Matern52 kernel, lengthscale 20, variance set from data, and a GP-UCB1 algorithm to make acquisitions which were done in batches of 8.
At each stage, the posterior was fitted using mean-field variational inference using the same distributional families as the prior.