A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments

  • 2019-11-01 10:45:12
  • Adam Foster, Martin Jankowiak, Matthew O'Meara, Yee Whye Teh, Tom Rainforth
  • 37

Abstract

We introduce a fully stochastic gradient based approach to Bayesian optimalexperimental design (BOED). This is achieved through the use of variationallower bounds on the expected information gain (EIG) of an experiment that canbe simultaneously optimized with respect to both the variational and designparameters. This allows the design process to be carried out through a singleunified stochastic gradient ascent procedure, in contrast to existingapproaches that typically construct an EIG estimator on a pointwise basis,before passing this estimator to a separate optimizer. We show that this, inturn, leads to more efficient BOED schemes and provide a number of a differentvariational objectives suited to different settings. Furthermore, we show thatour gradient-based approaches are able to provide effective design optimizationin substantially higher dimensional settings than existing approaches.

 

Quick Read (beta)

Abstract

We introduce a fully stochastic gradient based approach to \acrfullOED. This is achieved through the use of variational lower bounds on the \acrfullEIG of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct an \acrshortEIG estimator on a pointwise basis, before passing this estimator to a separate optimizer. We show that this, in turn, leads to more efficient \acrshortOED schemes and provide a number of a different variational objectives suited to different settings. Furthermore, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.

\Crefname

algocfAlgorithmAlgorithms \crefnamealgorithmAlgorithmAlgorithms \crefnamefigureFigureFigure \crefnamesection§§§ \Crefnamesection§§§ \crefformatequation(#2#1#3) \glsdisablehyper \newacronymKLklKullback-Leibler \newacronymSGDsgdstochastic gradient descent \newacronymOEDBOEDBayesian optimal experimental design \newacronymMIMImutual information \newacronymEIGEIGexpected information gain \newacronymNMCNMCnested Monte Carlo \newacronymACEACEadaptive contrastive estimation \newacronymNCENCEnoise contrastive estimation \newacronymBABABarber-Agakov \newacronymBOBOBayesian optimization \newacronymMCMCMonte Carlo \newacronymRMSERMSEroot mean squared error

 

A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments


 


Adam Foster                        Martin Jankowiak                        Matthew O’Meara§                        Yee Whye Teh                        Tom Rainforth†✠

Department of Statistics University of Oxford Oxford, UK                        Uber AI San Francisco, CA, USA                        §University of Michigan Ann Arbor, MI, USA                        Christ Church University of Oxford Oxford, UK

\glsresetall

1 INTRODUCTION

The design of experiments is a key problem in almost every scientific discipline. Namely, one wishes to construct an experiment that is most informative about the process being investigated, while minimizing its cost. For example, in a psychological trial, we want to ensure questions posed to participants are pertinent and do not have predictable responses. In a pharmaceutical trial, we want to minimize the number of participants needed to successfully test our hypotheses. In an online automated help system, we want to ensure we ask questions that identify the user’s problem as quickly as possible.

In all these scenarios, our ultimate high-level aim is to choose designs that maximize the information gathered by the experiment. A powerful and broadly used approach for formalizing this aim is \acrfullOED (chaloner1995; lindley1956; myung2013). In \acrshortOED, we specify a Bayesian model for the experiment and then choose the design that maximizes the \acrfullEIG from running it. More specifically, let θ denote the latent variables we wish to learn about from running the experiment and let ξΞ represent the experimental design. By introducing a prior p(θ) and a predictive distribution p(y|θ,ξ) for experiment outcomes y, we can calculate the \acrshortEIG under this model by taking the expected reduction in posterior entropy

I(ξ)𝔼p(y|ξ)[H[p(θ)]-H[p(θ|y,ξ)]], (1)

where H[] represents the entropy of a distribution and p(θ|y,ξ)p(y|θ,ξ)p(θ). Our experimental design process now becomes that of the finding the design ξ* that maximizes I(ξ).

Unfortunately, finding ξ* is typically a very challenging problem in practice. Even evaluating I(ξ) for a single design is computationally difficult because it represents a nested expectation and thus has no direct \acrlongMC estimator (nmc). Though a large variety of approaches for performing this estimation have been suggested (myung2013; watson2017quest+; kleinegesse2018efficient; foster2019variational), the resulting \acrshortOED strategies share a critical common feature: they estimate I(ξ) on a point-by-point basis and feed this estimator to an outer-level optimizer that selects which points to evaluate.

This framework can be highly inefficient for a number of reasons. For example, it adds an extra level of nesting to the overall computation process: I(ξ) must be separately estimated for each ξ, substantially increasing the overall computational cost. Furthermore, one must typically resort to gradient-free methods to carry out the resulting optimization, which means it is difficult to scale the overall \acrshortOED process to high dimensional design settings due to a dearth of optimization schemes which remain effective in such settings.

To alleviate these inefficiencies and open the door to applying \acrshortOED in high-dimensional settings, we introduce an alternative to this two-stage framework by introducing unified objectives that can be directly maximized to simultaneously estimate I(ξ) and optimize ξ. Specifically, by building on the work of foster2019variational, we construct variational lower bounds to I(ξ) that can be simultaneously optimized with respect to both the variational and design parameters. Optimizing the former ensures that we achieve a tight bound that in turn gives accurate estimates of I(ξ), while simultaneously optimizing the latter circumvents the need for an expensive outer optimization process. Critically, this approach allows the optimization to be performed using stochastic gradient ascent (SGA) (robbins1951stochastic) and therefore scaled to substantially higher dimensional design problems than existing approaches.

To account for the varying needs of different problem settings, we introduce several classes of suitable variational lower bounds. Most notably, we introduce the \acrfullACE bound: an \acrshortEIG variational lower bound that can be made arbitrarily tight, while remaining amenable to simultaneous SGA on both the variational parameters and designs.

We demonstrate the applicability of our unified gradient approach using a wide range of experimental design problems, including a real-world high-dimensional example from the pharmacology literature (lyu2019ultra). We find that our approaches are able to effectively optimize the \acrshortEIG, consistently outperforming baseline two-stage approaches, with particularly large gains achieved for high-dimensional problems. These gains lead, in turn, to improved designs and more informative experiments.

2 BACKGROUND

2.1 Bayesian optimal experimental design

When experimentation is costly, time consuming, or dangerous, it is essential to design experiments to learn the most from them. To choose between potential designs, we require a metric of the quality of a candidate design. In the \acrshortOED framework dating back to lindley1956, this metric represents how much more certain we will become in our knowledge of the world after doing the experiment and analyzing the data. We prefer designs that will lead to strong conclusions even if we are not yet sure what those conclusions will be.

Specifically, the \acrshortOED framework begins with Bayesian model of the experimental process. This is defined through the combination of a likelihood p(y|θ,ξ) that predicts the experimental outcome under design ξ and latent variable θ, and a prior p(θ) which incorporates initial beliefs about the unknown θ. After conducting the experiment, our beliefs are updated to the posterior p(θ|y,ξ). The information gained about θ from doing the experiment with design ξ and obtaining outcome y is the reduction in entropy from the prior to the posterior

IG(y,ξ)=H[p(θ)]-H[p(θ|y,ξ)]. (2)

As it stands, information gain cannot be evaluated until after the experiment. To define a metric that will let us choose between designs before experimentation, we define the expected information gain (EIG), I(ξ), by taking the expectation of IG over hypothesized outcomes y using the marginal distribution under our model, p(y|ξ), to give

I(ξ)𝔼p(y|ξ)[H[p(θ)]-H[p(θ|y,ξ)]] (3)

which can be rewritten in the form of a mutual information between θ and y with ξ fixed, namely

I(ξ)=MIξ(θ;y)=𝔼p(θ)p(y|θ,ξ)[logp(y|θ,ξ)p(y|ξ)]. (4)

The Bayesian optimal design, ξ*, is now the one which maximizes EIG over the set of feasible designs Ξ

ξ*=argmaxξΞI(ξ). (5)

In iterated experiment design, we design a sequence ξ1,,ξT of experiments. At time t, the prior p(θ) in (4) is replaced by the posterior given the previous experiments ξ1,,ξt-1 and their observed outcomes y1,,yt-1, namely

p(θ|ξ1:t-1,y1:t-1)p(θ)τ=1t-1p(yτ|θ,ξτ). (6)

This now provides a means to design a sequence of adaptive experiments, wherein we use information gathered from previous iterations to improve the designs used at future iterations.

2.2 Estimating expected information gain

Making even a single point estimate of \acrshortEIG when solving (5) can be challenging because we must first estimate the unknown p(y|ξ) or p(θ|y,ξ), and then take an expectation over p(θ)p(y|θ,ξ). Nested Monte Carlo (NMC) estimators (nmc), which make a Monte Carlo approximation of both the inner and outer integrals, converge relatively slowly: at a rate 𝒪(T-1/3) in the total computational budget T.

foster2019variational noted that this approach is inefficient because it makes a separate \acrlongMC approximation of the integrand for every sample of the outer integral. To share information between different samples, they proposed a number of variational estimators that used amortization, i.e. they attempted to learn the functional form of the integrand rather than approximating it afresh each time. One of their approaches was based on amortized variational inference and required an inference network qϕ(θ|y) which takes as input ϕ,y and outputs a distribution over θ. For any qϕ(θ|y), we can construct a lower bound on I(ξ). This is the \acrfullBA, or posterior, lower bound (ba)

IBA(ξ,ϕ)𝔼p(θ)p(y|θ,ξ)[logqϕ(θ|y)]+H[p(θ)], (7)

which has also found use representation learning (poole2018variational) and maximizing information transmission in noisy channels (ba).

To make high-quality approximations to I(ξ), and simultaneously learn a good posterior approximation, foster2019variational maximize this bound with respect to ϕ. This approach is most effective when the bound is tight, i.e. maxϕIBA(ξ,ϕ)=I(ξ). For IBA(ξ,ϕ), this occurs when it is possible to have qϕ(θ|y)=p(y|θ,ξ), i.e. when the inference network is powerful enough to find the true posterior distribution for every y.

The required optimization is performed using SGA (robbins1951stochastic), which requires unbiased gradient estimators. For IBA/ϕ, this is obtained by drawing N Monte Carlo samples θn,yni.i.d.p(θ)p(y|θ,ξ) and forming the estimator

IBAϕ^=1Nn=1Nϕlogqϕ(θn|yn). (8)

To obtain high-quality approximations of I(ξ) even when the inference network cannot capture the true posterior, foster2019variational also considered another variational estimator: variational nested Monte Carlo (VNMC). This uses the inference network qϕ(θ|y) in conjunction with additional samples to improve the estimate of the integrand. They showed that this leads to the following upper bound on I(ξ)

IVNMC(ξ,ϕ,L)𝔼[logp(y|θ0,ξ)1L=1Lp(θ)p(y|θ,ξ)qϕ(θ|y)], (9)

where the expectation is over p(θ0)p(y|θ0,ξ)qϕ(θ1:L|y). The inference network in VNMC is trained by minimization, in the same way IBA is trained by maximization. IVNMC has the attractive feature that the bound becomes tight as the number of inner samples becomes large, i.e. L, even if qϕ(θ|y) is not powerful enough to directly represent the true posterior.

2.3 Optimizing the EIG

The experimental design problem is to find the design that maximizes the EIG. Therefore, as well as finding a way to estimate EIG, existing approaches subsequently need to find a way of searching across Ξ to find promising designs. At a high-level, most existing approaches propose a two-stage procedure in which noisy estimates of I(ξ) are made, and a separate optimization procedure selects the candidate design ξ to evaluate next.

kleinegesse2018efficient and foster2019variational both use \acrfullBO for this outer optimization step, a black-box optimization method (snoek2012practical) that is tolerant to noise in the estimates of the objective function, in this case I(ξ). Some approaches (watson2017quest+; lyu2019ultra) instead select a finite number of candidate designs in Ξ and estimate I(ξ) at each candidate, with some refining this process further by adaptively allocating computational resources between these designs (vincent2017; rainforth2017thesis). Another suggested approach is to use MCMC methods to carry out this outer optimization (amzal2006bayesian; muller2005simulation).

3 GRADIENT-BASED DESIGN OPTIMIZATION

Our central proposal is to replace the two-stage procedure outlined above with a single stage that both estimates I(ξ) and optimizes ξ simultaneously. This has the critical advantage of allowing SGA to be directly applied to the design optimization. Not only does this provide substantial computational gains over approaches which must construct separate estimates for each hypothetical design considered, but it also provides the potential to scale to substantially higher dimensional design problems than those which can be effectively tackled with existing approaches. Since we take gradients with respect to ξ, we henceforth assume that Ξ is continuous.

In our approach, we utilize variational lower bounds on I. Specifically, suppose we have a bound (ξ,ϕ)I(ξ) with variational parameters ϕ. For fixed ξ, the estimate of I(ξ) improves as we maximize with respect to ϕ. We propose to maximize jointly with respect to (ξ,ϕ). As we train ϕ, the variational approximation improves; as we train ξ our design moves to regions where the lower bound on \acrshortEIG is largest. By tackling this as a single optimization problem over (ξ,ϕ), we obviate the need to have an outer optimizer for ξ. Using a lower bound is important because it allows us to perform a single maximization over (ξ,ϕ), rather than a more complex optimization such as the max-min optimization that would result if we used an upper bound.

In practice, we do not have lower bounds on I that we can evaluate and differentiate in closed form. Instead, we have lower bounds in the form of expectations over p(θ)p(y|θ,ξ). Fortunately, we can still maximize the lower bound with respect to (ξ,ϕ) in this case by using SGA. Stochastic gradient methods have been applied in machine learning to train millions of parameters (bottou2010large) and are a good fit for high-dimensional experimental design.

3.1 \acrfullBA

We now make our first concrete proposal for the lower bound (ξ,ϕ): the BA bound IBA, as defined in (7). We now optimize (ξ,ϕ) jointly where previously only ϕ was trained using gradients. To perform SGA, we require unbiased estimators for both IBA/ϕ and IBA/ξ. We use equation (8) to estimate the ϕ-gradient. For the ξ-gradient we have

IBAξ^=1Nn=1Nlogqϕ(θn|yn)ξlogp(yn|θn,ξ) (10)

where θn,yni.i.d.p(θ)p(y|θ,ξ), which is an unbiased estimator of IBA/ξ. This is a score function estimator (williams1992simple) and other possibilities are discussed in Section 3.6.

3.2 Adaptive contrastive estimation (ACE)

The BA bound provides one specific case of our one-stage procedure for optimal experimental design. We now introduce a new lower bound that improves upon IBA. The potential issue with the BA bound is that it may not be sufficiently tight, which happens when the inference network cannot represent the true posterior. One possible solution is to introduce additional samples, as in the VNMC estimator (9). However, we cannot use VNMC directly for a one-stage procedure: since it is an upper bound, we must minimize it with respect to ϕ, but we still wish to maximize with respect to ξ. This precludes a joint maximization over (ξ,ϕ).

Looking more closely at the VNMC bound, we see that its main failure case is when the denominator strongly under-estimates p(y|ξ), which can happen when all the inner samples θ1,,θL miss regions where the joint p(θ)p(y|θ,ξ) is large. In addition to the samples θ1:L, we also have the original sample θ0 from which y was sampled. One way to avoid the under-estimation in the denominator would be to include this sample, giving

IACE(ξ,ϕ,L)=𝔼[logp(y|θ0,ξ)1L+1=0Lp(θ)p(y|θ,ξ)qϕ(θ|y)] (11)

where the expectation is with respect to p(θ0)p(y|θ0,ξ)q(θ1:L|y). The samples θ1:L can now be seen as contrasts to the original sample θ0. For this reason, we call θ1:L contrastive samples and we call (11) the adaptive contrastive estimate (ACE) of EIG. The following theorem establishes that IACE is a valid lower bound on the EIG and that the bound becomes tight as L.

Theorem 1.

For any model p(θ)p(y|θ,ξ) and inference network qϕ(θ|y), we have the following:

  1. 1.

    IACE is a lower bound on the \acrshortEIG and we can characterize the error term as an expected KL divergence:

    I (ξ)-IACE(ξ,ϕ,L)
    =𝔼p(y|ξ)[KL(P(θ0:L|y)||qϕ(θ|y))]0,
    P (θ0:L|y)=1L+1=0Lp(θ|y,ξ)kqϕ(θk|y).
  2. 2.

    As L, we recover the true \acrshortEIG:
    limLIACE(ξ,ϕ,L)=I(ξ).

  3. 3.

    The ACE bound is monotonically increasing in L: IACE(ξ,ϕ,L2)IACE(ξ,ϕ,L1) for L2L10.

  4. 4.

    If the inference network equals the true posterior qϕ(θ|y)=p(θ|y,ξ), then IACE(ξ,ϕ,L)=I(ξ),L.

The proof is presented in Appendix A, along with Theorem 3 which concerns properties of the maximum of the ACE bound. Gradient estimation for ACE is discussed in Section 3.6. We note that, to the best of our knowledge, IACE has not previously appeared in the \acrshortOED literature.11 1 Aside from a recent blog post (sobolev2019thoughts) we do not believe this bound has previously been studied in any context.

3.3 Prior contrastive estimation (PCE)

Theorem 1 tells us that IACE can become close to I(ξ) in two scenarios: 1) the inference network becomes close to the true posterior p(θ|y,ξ), 2) we increase the number of contrastive samples L. The BA bound only becomes tight in case 1). A special case of ACE is to replace the inference network qϕ(θ|y) with a fixed distribution and rely on the contrastive samples to make good estimates of I(ξ), only becoming tight in case 2), i.e. as L. This simplification can speed up training, since we no longer need to learn additional parameters ϕ.

To explore this, we propose the prior contrastive estimation (PCE) bound, in which the prior p(θ) is used to generate contrastive samples:

IPCE(ξ,L)𝔼[logp(y|θ,ξ)1L+1=0Lp(y|θ,ξ)], (12)

where the expectation is over p(θ0)p(y|θ0,ξ)p(θ1:L). Whilst inherently less powerful than ACE, PCE can be effective when the prior and posterior are not too different, so p(θ) is a suitable proposal to estimate p(y|ξ).

Though, to the best of our knowledge, this bound has not been applied to \acrshortOED before, we note that it shares a connection to the information noise contrastive estimation (InfoNCE) bound on mutual information used in representation learning (oord2018representation). Given K data samples xk, corresponding representations zk, and a learned critic fψ(x,z)0, we have

MI(x;z)𝔼[1Kk=1Klogfψ(xk,zk)1K=1Kfψ(x,zk)] (13)

where the expectation is over p(x)p(z|x), p(x) is the data distribution, and p(z|x) is the encoder. poole2018variational showed that the encoder density p(z|x) is the optimal critic, although it is rarely known in closed form in the representation learning context. Writing θ for x and y for z, we note the mathematical connection between this optimal case and IPCE.

3.4 Likelihood-free ACE

In some models such as random effects models, the likelihood p(y|θ,ξ) is not known in closed form but can be sampled from. This presents a problem when computing IACE or its derivatives because the likelihood appears in (11). To allow ACE to be used for these kinds of models, we now show that using a unnormalized approximation to the likelihood still results in a valid lower bound on \acrshortEIG. In fact, if using a parametrized likelihood approximation fψ, it is then possible to train ψ jointly with (ξ,ϕ) to approximate the likelihood, learn an inference network, and find the optimal design through the solution to a single optimization problem. The following theorem, whose proof is presented in Appendix A, shows that replacing the likelihood with an unnormalized approximation does result in a valid lower bound on \acrshortEIG.

Theorem 2.

Consider a model p(θ)p(y|θ,ξ) and inference network qϕ(θ|y). Let fψ(θ,y)0 be an unnormalized likelihood approximation. Then,

I(ξ)𝔼[logfψ(θ0,y)1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y)] (14)

where the expectation is over p(θ0)p(y|θ0,ξ)qϕ(θ1:L|y).

3.5 Iterated experimental design with ACE

In iterated experimental design, we replace p(θ) by p(θ|y1:t-1,ξ1:t-1) as per (6). We can sample p(θ|y1:t-1,ξ1:t-1) by performing inference. Whilst variational inference also provides a closed form estimate of the posterior density, some other inference methods do not. This is problematic because the prior density appears in (11). Fortunately, it is sufficient to know the density up to proportionality (foster2019variational). Indeed if p(θ)=Aγ(θ) where A does not depend on (ξ,ϕ,y) and γ is an unnormalized density, then

I(ξ)𝔼[logp(y|θ0,ξ)1L+1=0Lγ(θ)p(y|θ,ξ)qϕ(θ|y)]-logA (15)

and the derivatives of logA are simply zero.

3.6 Gradient estimation for ACE

To optimize the ACE bound with respect to (ξ,ϕ) it is necessary to have unbiased gradient estimators of IACE/ξ and IACE/ϕ.

The simplest form of the ξ-gradient is

IACEξ=𝔼[gξ+gξlogp(y|θ0,ξ)]. (16)

where the expectation is with respect to p(θ0)p(y|θ,ξ)q(θ1:L|y), and

g(y,θ0:L,ϕ,ξ)=logp(y|θ0,ξ)1L+1=0Lp(θ)p(y|θ,ξ)qϕ(θ|y). (17)

Estimating the expectation (16) directly using Monte Carlo gives the score function, or REINFORCE, estimator (williams1992simple). Unfortunately, this is often high variance, and reducing gradient estimator variance is often important in solving challenging experimental design problems.

One variance reduction method is reparameterization. For this, we introduce random variables ϵ,ϵ1:L whose distributions do not depend on (ξ,ϕ) along with representations of y and θ as deterministic functions of these variables:

y=y(θ0,ξ,ϵ)  θ=θ(y,ϕ,ϵ). (18)

This now permits a reparameterized form of the gradient:

IACEξ=𝔼[gξ+gyyξ+=1Lgθθyyξ] (19)

where the expectation is over p(θ0)p(ϵ)p(ϵ1:L). A Monte Carlo approximation of this expectation is typically a much lower variance estimator for the true ξ-gradient.

Alternatively, if y is a discrete random variable we can sum over the possible values 𝒴. This approach is known as Rao-Blackwellization and gives

IACEξ=y𝒴𝔼[gξp(y|θ0,ξ)+gξp(y|θ0,ξ)] (20)

where the expectation is now over p(θ0)=1Lqϕ(θ|y).

Turning to IACE/ϕ, we note that if θ1:L are reparameterizable as in (18), then we can utilize the double reparameterization of tucker2018doubly; for full details see Appendix A.1.

4 EXPERIMENTS

Figure 1: A sample trajectory for the death process. The grayscale shows the EIG surface (white is maximal), whilst crosses show the optimization trajectory of ξ using ACE with pink representing later steps. See Sec. 4.2 for details.

In our experiments, we learn optimal experimental designs in five scenarios: the death process, a well known two-dimensional design problem from epidemiology; a non-conjugate regression model with a 400-dimensional design; an ablation study in the setting of advertising; a real-world biomolecular docking problem from pharmacology in 100 dimensions; and a constant elasticity of substitution iterated experimental design problem in behavioural economics with 6 dimensional designs.

4.1 Evaluating experimental designs

We first discuss which metrics we will use to judge the quality of the designs we obtain.

Our primary metric on designs is, of course, the \acrshortEIG. We prefer designs with high \acrshortEIGs. In some cases, we can evaluate the \acrshortEIG analytically. In other cases, we can use a sufficiently large number of samples in a \acrshortNMC (nmc) estimator to be sure that we have estimates that are sufficiently accurate to compare designs.

However, to explore the limits of our methods, we will also consider scenarios where neither of these approaches is suitable. In these cases, we pair the ACE lower bound (with ξ fixed for evaluation) with the VNMC upper bound (foster2019variational) to trap the true \acrshortEIG value—if the lower bound of one design is higher than the upper bound for another, we can be sure that the first design is superior (noting that the bounds themselves can be tractably estimated to a very high accuracy).

In some settings, when we know the true optimal design ξ*, we will also consider the design error ξ*-ξ, i.e. how close our design is to the optimal design.

In iterated experiment design, as well as designing experiments, we must also perform inference on the latent variable θ after each iteration. Here, we also investigate the quality of the final posterior. Specifically, if p(θ|y1:t,ξ1:t) is the posterior after t experiments, we use the posterior entropy, and the posterior RMSE 𝔼θp(θ|y1:t,ξ1:t)[(θ-θ*)2]1/2. We prefer low entropy and low RMSE values.

4.2 Death process

Figure 2: Optimization of EIG for the death process as a function of wall clock time. We depict the mean and ±1 standard error (s.e.) from 100 runs. The final EIG values (rightmost points) are as follows: [ACE] 0.9830±0.0001, [PCE] 0.9822±0.0001, [BA] 0.9822±0.0002, [BO] 0.9732±0.0009. See Sec. 4.2 for details.

We consider an example from epidemiology, the death process (cook2008optimal; kleinegesse2018efficient), in which a population of N=10 individuals transitions from healthy to infected states at a constant but unknown rate θ. We can measure the number of infected individuals at two different times ξ1 and ξ1+ξ2 where ξ1,ξ20. Our aim is to infer the infection rate θ from these observations. For full details of the prior and likelihood used, see Appendix B.1.

On this problem, we apply gradient methods with Rao-Blackwellization (there are 66 possible outcomes). Figure 1 shows a sample optimization trajectory with the approximate EIG surface for illustration. We compare against \acrshortBO using the Rao-Blackwellized NMC estimator of vincent2017. Figure 2 shows that, for the allowed time budget, all gradient methods perform better than BO even on this two-dimensional problem.

4.3 Regression

We now compare our one-stage gradient approaches to experimental design against a two-stage baseline on a high-dimensional design problem. We choose a general purpose Bayesian linear regression model with n observations and p features. The design ξ is an n×p matrix; the latent variables are θ=(𝐰,σ), where 𝐰 is the p dimensional regression coefficient and σ2 is the scalar variance. The n outcomes are generated using a Normal likelihood yiN(𝝃i𝐰,σ) for i=1,,n. Here 𝝃i is the ith row of ξ. To avoid trivial solutions, we enforce the constraint 𝝃i1=1 for all i. We use independent priors wjLaplace(1) for j=1,,p and σExp(1). See Appendix B.2 for complete details.

We set n=p=20 and applied five methods to this 400 dimensional design problem: BA, ACE and PCE, as well as the VNMC estimator of foster2019variational, with both \acrshortBO and random search to optimize over Ξ. The results are presented in Table 1. We note that the gradient methods strongly outperform the gradient-free baselines, with about double the final EIG.

Table 1: Regression results. We estimate lower and upper bounds on the final EIG and present the mean and ±1 s.e. from 10 runs. See Sec. 4.3 for details.

Method
EIG l.b. EIG u.b.
ACE 16.1±0.1 20.7±0.2
PCE 16.6±0.1 21.5±0.2
BA 16.4±0.2 21.1±0.2
BO + VNMC 7.3±0.1 9.6±0.1
Random Search + VNMC 7.1±0.1 9.4±0.1

4.4 Advertising

We now conduct a detailed ablation study on the effects of dimension on the quality of experimental designs produced using our gradient approaches and \acrshortBO. To further isolate the distinction between one-stage and two-stage approaches to \acrshortOED, we choose a setting in which we can compute I(ξ) analytically. We give \acrshortBO, but not the gradient methods, access to the EIG oracle when making point evaluations of I(ξ). Thus we put \acrshortBO in the best possible position and ensure any gains are due to improvements from using gradient-based optimization.

Suppose that we are given an advertising budget of B dollars that we need to allocate among D regions, i.e. we choose 𝝃𝟎 with i=1Dξi=B. After conducting an ad campaign, we observe a vector of sales 𝒚. We use this data to make inferences about the underlying market opportunities 𝜽 in each region. Our prior incorporates the knowledge that neighbouring regions are more correlated than distant ones—this leads to an interesting experimental design problem because information can be pooled between regions. We can also compute the true EIG and optimal design ξ* analytically. For full details, see Appendix B.3.

We compare the performance of four estimation and optimization methods on this problem, see Fig. 3 for the results. The three gradient-based methods (ACE, PCE, BA) perform best, with the BO baseline struggling in dimensions D6, even though the latter has access to an EIG oracle. PCE performed well in low dimensions, but degraded as the dimension increases and sampling from the prior becomes increasingly inefficient—something that ACE and BA avoid by learning adaptive proposal distributions. We note that since in this case the family of variational distributions used in ACE and BA include the true posterior, both methods yield similar performance.

Figure 3: Mean absolute EIG and design errors for the marketing model in Sec. 4.4 obtained with four different estimators and averaged over 10 runs. The EIG is normalized such that an EIG error of unity corresponds to doing no better than a uniform budget, i.e. ξi=B/D for i=1,,D.

4.5 Biomolecular docking

(a) Entropy
(b) Posterior RMSE of ρ
(c) Posterior RMSE of 𝜶
(d) Posterior RMSE of u
Figure 4: Improvement in the posterior in the sequential CES experiment. Each step took 120 seconds for each method. (a) Total entropy. (b)(c)(d) The root mean square error (RMSE) of the posterior approximations of ρ, 𝜶 and u compared to the true values. We present the mean and ±1 standard error from 10 runs. See Sec. 4.6 for details.
Figure 5: Designs for the biomolecular docking problem obtained by ACE and by lyu2019ultra. Designs consist of 100 docking scores at which to test compounds.

We now consider an experimental design problem of interest to the pharmacology community. Having demonstrated that our one-stage gradient methods compare favourably with two stage approaches, we now compare against designs crafted by domain experts.

In molecular docking, computational techniques are used to predict the binding affinity between a compound and a receptor. When synthesized in the lab, the two may bind—this is called a hit. Learning a well-calibrated hit-rate model can guide how many compounds to evaluate for additional objectives, such as drug-likeness or toxicity, before experimental testing. lyu2019ultra modelled the probability of outcome yi being a hit, given the predicted binding affinity, or docking score, ξi[-75,0], as

p(yi=1|θ,ξ)=bottom+top-bottom1+e-(ξi-ee50)×slope (21)

where θ=(top,bottom,ee50,slope) with priors given in Appendix B.4.

Of 150 million compounds, lyu2019ultra selected a batch of compounds to experimentally test to best fit the sigmoid hit-rate model. They considered 6 candidate designs and selected one that maximized the EIG estimated by NMC. Here, we instead apply gradient-based BOED to search across candidate designs which consist of 100 docking scores ξ1,,ξ100. To evaluate our final designs, we present upper and lower bounds on the final EIG: see Table 2. We see that all gradient methods are able to outperform experts in terms of \acrshortEIG, and that ACE appears the best of the gradient methods. Figure 5 shows our designs are qualitatively different to those produced by experts.

4.6 Constant elasticity of substitution

Table 2: Biomolecular docking results. We computed lower and upper bounds on the final EIG. We present the mean and ±1 s.e. from 10 runs. For the expert, we took the best design of lyu2019ultra appropriately rescaled to consist of 100 docking scores for comparison.
Method EIG lower bound EIG upper bound
ACE 1.0835±0.0003 1.0852±0.0001
PCE 1.0825±0.0002 1.0839±0.0002
BA 1.0780±0.0003 1.0794±0.0003
Expert 1.0191 1.0227

We now turn to iterated experimental design in which we produce designs, generate data and make inference repeatedly. This problem therefore captures the end-to-end-process of experimentation and inference.

We consider an experiment in behavioural economics that was previously also considered by foster2019variational. In this experiment, a participant is asked to compare baskets 𝐱,𝐱 of goods. The model assumes that their response (on a slider) is based on the difference in utility of the baskets, and the constant elasticity of substitution (CES) model (arrow1961capital) governed by latent variables (ρ,𝜶,u) is then used for this utility. The aim is to learn (ρ,𝜶,u) characterizing the participant’s utility. In the experiment, there are 20 sequential steps of experimentation with the same participant. We compare our gradient-based approach against the most successful approach of foster2019variational that approximates the marginal density to form an upper bound on EIG, and BO to optimize ξ. For full details, see Appendix B.5.

Figure 4 shows that gradient-based methods are effective on this problem; both ACE and PCE decrease the posterior entropy and RMSEs on the latent variables faster and further than the baseline, whereas BA does not do so well. We suggest that the similar performance of ACE and PCE is due to the smaller changes in the posterior at middle and late steps, after much data has been accumulated: when the posterior does not change much at each step, p(θ|y1:t-1,ξ1:t-1) forms an effective proposal for estimating p(yt|ξt).

5 CONCLUSIONS

We have introduced a new approach for Bayesian experimental design that does away with the two stages of estimating EIG and separately optimizing over Ξ. We use stochastic gradients to maximize a lower bound on I(ξ) and so find optimal designs by solving a single optimization problem. This unification leads to substantially improved performance, especially on high-dimensional design problems.

Of the three lower bounds, IBA,IACE and IPCE, we note that in all five experiments ACE generally did as well as the better of BA and PCE: we therefore recommend it as the default choice. BA performed well when the inference network could closely approximate the true posterior; PCE performed well when the prior was an adequate proposal for estimating p(y|ξ) and does not require the training of variational parameters.

Acknowledgements

AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. TR gratefully acknowledges funding from Tencent AI Labs and a junior research fellowship supported by Christ Church, Oxford.

References

Appendix A GRADIENT-BASED DESIGN OPTIMIZATION

We begin with the proof of Theorem 1, which we restate for convenience.

See 1

Proof.

To begin with 1., we have the error term δ=I(ξ)-IACE(ξ,ϕ,L) which can be written

δ =𝔼[log1L+1=0Lp(θ)p(y|θ,ξ)qϕ(θ|y)p(y|ξ)] (22)
=𝔼[log1L+1=0Lp(θ|y)kqϕ(θ|y)=0Lqϕ(θ|y)] (23)
=𝔼[logP(θ0:|y)=0Lqϕ(θ|y)] (24)

where the expectation is over p(y|ξ)p(θ0|y,ξ)=1Lqϕ(θ|y). Note that the integrand is symmetric under a permutation of the labels 0,,L, so its expectation will be the same under the distribution p(y|ξ)p(θ|y,ξ)kqϕ(θk|y). This in turn implies that we can consider this as an expectation under P

δ=𝔼p(y|ξ)P(θ0:L|y)[logP(θ0:L|y)=0Lqϕ(θ|y)] (25)

which is the expected KL divergence required. We therefore have δ0.

For 2., we assume that p(θ)p(y|θ,ξ)/qϕ(θ|y) is bounded. The ACE denominator is a consistent estimator of the marginal likelihood. Indeed,

1L+1p(θ0)p(y|θ0,ξ)qϕ(θ0|y)0 by boundedness (26)

and

1L+1=1Lp(θ)p(y|θ,ξ)qϕ(θ|y)p(y|ξ)a.s. (27)

as L by the Strong Law of Large Numbers, since

𝔼qϕ(θ|y)[p(θ)p(y|θ,ξ)qϕ(θ|y)]=p(y|ξ). (28)

This establishes the a.s. pointwise convergence of the ACE integrand to logp(y|θ0,ξ)/p(y|ξ). Hence by Bounded Convergence Theorem,

I^ACE(ξ,ϕ,L)I(ξ) (29)

as L.

To establish 3., we let ε=IACE(ξ,ϕ,L2)-IACE(ξ,ϕ,L1). Then

ε =𝔼[log1L1+1=0L1p(θ)p(y|θ,ξ)q(θ|y)1L2+1=0L2p(θ)p(y|θ,ξ)q(θ|y)] (30)
=𝔼[logQ(θ0:L2|y)1L2+1=0L2p(θ|y)kq(θ|y)] (31)

where the expectation is over p(y|ξ)p(θ0|y,ξ)=1L2q(θ|y) and

Q(θ0:L2|y)=1L1+1=0L1p(θ|y)kL2q(θ|y). (32)

By symmetry, we can consider this as an expectation over Q (we can permute the labels 0,,L1). We therefore recognise ε as the expectation of a KL divergence. Hence ε0 as required.

4. follows by Bayes Theorem, i.e.

p(θ)p(y|θ,ξ)p(θ|y,ξ)=p(y|ξ). (33)

which completes the proof. ∎

We also present the proof of Theorem 2.

See 2

Proof.

Initially, we note that the contrastive samples θ1,,θL do not carry additional information about θ0. We consider the mutual information between θ0 and the random variable (y,θ1,,θL). Using the Chain Rule for mutual information we have

MI(θ0;(y,θ1,,θL))=MI(θ0;y)+MI(θ0;(θ1,,θL)|y) (34)

Now MI(θ0;(θ1,,θL)|y)=0 since θ (>0) are conditionally independent of θ0 given y. Therefore

MI(θ0;(y,θ1,,θL))=MI(θ0;y)=I(ξ). (35)

We now use the Donsker-Varadhan representation of mutual information (donsker1975asymptotic). Specifically, for random variables A,B with joint distribution p(a,b) and any measurable function T(a,b) we have

MI(a;b)𝔼p(a,b)[T(a,b)]-log𝔼p(a)p(b)[eT(a,b)]. (36)

We now use this representation with a=θ0,b=(y,θ1,,θL) and T(a,b) the integrand

T(θ0,(y,θ1:L))=logfψ(θ0,y)1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y). (37)

We compute the second term in (36), Z=𝔼p(a)p(b)[eT(a,b)].

Z =𝔼p(θ0)p(y|ξ)qϕ(θ1:L|y)[fψ(θ0,y)1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y)] (38)
=𝔼p(y|ξ)qϕ(θ0:L|y)[p(θ0)fψ(θ0,y)qϕ(θ0|y)1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y)] (39)
=𝔼p(y|ξ)qϕ(θ0:L|y)[1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y)1L+1=0Lp(θ)fψ(θ,y)qϕ(θ|y)] (40)
=1 (41)

where the second to last line follows by symmetry. This establishes that logZ=0, and so (14) constitutes a valid lower bound on I(ξ). That is

I(ξ) 𝔼[logfψ(y,θ0)1L+1=0Lp(θ)fψ(y,θ)qϕ(θ,y)] (42)

which completes the proof. ∎

The following theorem establishes a condition under which the maximum of the ACE objective converges to the maximum of the EIG as L.

Theorem 3.

Consider a model p(θ)p(y|θ,ξ) such that

CsupξΞinfϕΦ𝔼p(θ)p(y|θ,ξ)[p(θ|y,ξ)qϕ(θ|y,ξ)]<. (43)

and I*supξΞI(ξ)<. Let qϕ(θ|y) be an inference network and let

IL=supξΞ,ϕΦIACE(ξ,ϕ,L). (44)

Then,

0I*-ILC-1L+1 (45)

and in particular ILI* as L.

Proof.

We have 0I*-IL since IACE is a lower bound on I(ξ) by Theorem 1.

Next, we consider Δ(ξ,ϕ,L)=I(ξ)-IACE(ξ,ϕ,L). We have

Δ=𝔼p(θ0)p(y|θ0,ξ)qϕ(θ1:L|y)[logYLp(y|ξ)] (46)

where

YL=1L+1=0Lwandw=p(θ)p(y|θ,ξ)qϕ(θ|y); (47)

we write (46) as

Δ=𝔼[log(1+YL-p(y|ξ)p(y|ξ))] (48)

and we apply the inequality log(1+x)x to give

Δ𝔼[YL-p(y|ξ)p(y|ξ)]. (49)

We now observe that for >0, 𝔼qϕ(θ|y)[w]=p(y|ξ) and hence, taking a partial expectation over θ1:L we have

Δ 𝔼p(θ0)p(y|θ0,ξ)[w0-p(y|ξ)(L+1)p(y|ξ)] (50)
1L+1(𝔼p(θ0)p(y|θ0,ξ)[p(θ0|y,ξ)qϕ(θ0|y)]-1) (51)

Hence

I*-IL =supξΞI(ξ)-supξΞ,ϕΦIACE(ξ,ϕ,L)] (52)
supξΞ[I(ξ)-supϕΦIACE(ξ,ϕ,L)] (53)
supξΞinfϕΦ[Δ(ξ,ϕ,L)] (54)
C-1L+1 (55)

as required. ∎

A.1 Double reparametrization

We have the ϕ-gradient

IACEϕ=𝔼[-ϕ𝔼q(θ1:L|y)[log(=0Lw)|θ0,y]] (56)

where

w=p(θ)p(y|θ,ξ)qϕ(θ|y) (57)

and the outer expectation is over p(θ0)p(y|θ0,ξ). If qϕ(θ|y) is reparameterizable as a function of ϕ, then we can apply double reparameterization to this gradient. Indeed, were it not for the w0 term, this would be exactly the IWAE of burda2015importance. We can exploit the double reparameterization of tucker2018doubly with a minor variation to account for w0 to obtain a low variance gradient estimator.

The doubly reparametrized gradient for ACE takes the form

IACEϕ=𝔼p(θ0)p(y|θ0,ξ)qϕ(θ1:L|y)[=0Lv] (58)

where

v0=w0m=0Lwmϕlogq(θ0|y) (59)

and for >0

v=-(wm=0Lwm)2logwθθϕ. (60)

A.2 Alternative gradient

We begin with an observation: the true integrand when computing the EIG as an expectation over p(θ0)p(y|θ0,ξ) is given by

g*(y,θ0,ξ)=logp(y|θ0,ξ)p(y|ξ). (61)

Recall the score function identity

𝔼p(x|ξ)[ξlogp(x|ξ)]=0. (62)

We have

𝔼p(θ)p(y|θ,ξ) [g*ξ] (63)
=𝔼p(θ)p(y|θ,ξ)[ξlogp(y|θ,ξ)p(y|ξ] (64)
=𝔼p(θ)(𝔼p(y|θ,ξ)[ξp(y|θ,ξ)])-𝔼p(y|ξ)[ξlogp(y|ξ)] (65)
=0 (66)

by two applications of the score function identity. This suggests that, as g becomes close to g*, the g/ξ term in (16) has expectation close to zero, and primarily contributes variance to the gradient estimator.

Theorem 2 justifies removing the g/ξ term. Removing this term is equivalent to the following gradient-coordinate algorithm. First, we choose the family fψ(θ,y) to be p(y|θ,ψ). Then at time step t we do the following

  1. 1.

    Set ψt=ξt

  2. 2.

    Take a gradient step with respect to (ξ,ϕ) to update ξt,ϕt

Importantly, the new gradient does not include a g/ξ term, but is the gradient of a valid lower bound on EIG.

Appendix B EXPERIMENTS

B.1 Death process

Table 3: Death process. We present the final EIG for each method (computed using NMC with 200000 samples).
Method EIG mean ±1 s.e.
ACE 0.9830±0.0001
PCE 0.9822±0.0001
BA 0.9822±0.0002
ACE without RB 0.9789±0.0006
PCE without RB 0.9710±0.0025
BA without RB 0.9322±0.0045
BO with NMC 0.9732±0.0009

We place the prior θLogNormal(0,1) on the infection rate and have the likelihood

I1Binomial(N,e-θξ1)I2Binomial(N-I1,e-θξ2). (67)

We also have the constraint ξ1,ξ20.

For each method, we fixed a computational budget of 120 seconds, and did 100 independent runs. For gradient methods, we used the Adam optimizer with learning rate 10-3 and the default momentum parameters. The inference network made a separate Gaussian approximation to the posterior for each of the 66 outcomes. To evaluate I(ξ) for comparison we used NMC with a large number of samples: 20000 for Figure 2 and 200000 for the final values in the caption and in Table 3. For the BO, we used a Matern52 kernel with variance 1 and lengthscale 0.25, and the GB-UCB1 algorithm for acquisition.

We used the following number of samples for our Rao-Blackwellized estimators

Method Number of samples
ACE 10 + 660
PCE 10
BA 10
NMC 2000
Figure 6: The EIG against time for the death process: comparing Rao-Blackwellization against no Rao-Blackwellization. Each method had a 120 second time budget.

B.2 Regression

We consider the following prior on θ=(𝐰,σ)

wj i.i.d.Laplace(1) for j=1,,p (68)
σ Exponential(1) (69)

with the likelihood

yiN(j=1pξijwj,σ) for i=1,n. (70)

This represents a standard regression model, although with non-Gaussian prior distributions we cannot compute the posterior or true EIG analytically. To ensure the EIG has a finite maximum, we impose the following constraint

j|ξij|=1 for i=1,,n. (71)

In practice, we set n=p=20.

For each of our five methods, we fixed the computational budget to 15 minutes and did 10 independent runs. For gradient methods, we used a learning rate of 10-3 and the Adam optimizer (kingma2014adam) with default momentum parameters. The inference network used the following variational family

𝐰 N(𝝁,sΣ0) (72)
σ Γ(α,β) (73)

and we used a neural network with the following architecture

Operation Size Activation
Input H1 64 ReLU
H1 H2 64 ReLU
H2 𝝁 20 -
H2 (α,β) 2 Softplus
H2 s 1 Softplus
Σ0 20×20 -

For the other methods, point evaluations of I(ξ) were made using VNMC. Each VNMC evaluation took 1000 steps, with the optimization as above (but with ξ fixed). We used a GP with Matern52 kernel with lengthscale 5, variance 10. We used a GP-UCB1 acquisition rule, and terminated once 15 minutes had passed. For random search, we sampled designs using a standard unit Gaussian.

We used the following number of samples

Method Inner samples L Outer samples N
ACE 10 10
PCE 10 10
BA n/a 100
VNMC 10 10

To evaluate designs, we used ACE/VNMC. We first trained ACE using the same procedure as above, for 20000 steps. Then we made the final ACE/VNMC evaluations using the fixed inference network and L=2.5×103 inner samples, N=105 outer samples.

B.3 Advertising

We introduce a LogNormal likelihood and a D-dimensional latent variable 𝜽 governed by a Normal prior, the joint density of our model is

p(𝒚,𝜽|𝝃)=𝒩(𝒚|𝜽𝝃,σ2𝝃)𝒩(𝜽|𝟎,𝚲0) (74)

where σ controls the observation noise and 𝚲0 is a non-diagonal precision matrix. Since there are correlations among the D regions, the optimal advertising budget (w.r.t. gaining information about 𝜽) allocates more money to the regions that are tightly correlated.

Throughout we assume that the number of regions D is even. We set the budget to scale with the number of dimensions, B=D2, set σ=1 and choose the prior precision matrix to be

𝚲0=(1+1D)𝕀D-1D𝒖𝒖T  𝒖T(α,,α,1,,1)

where the first (last) D2 components of 𝒖 are given by α (1), respectively, and where α=0.1 controls (as we shall see) the degree of asymmetry in the optimal design. Discarding an irrelevant constant, we can compute the exact EIG using the formula:

I(𝝃)=12logdet𝚲post  𝚲post=𝚲0+1σ2diag(𝝃)

Using the matrix determinant lemma for rank-1 matrix updates we can then compute

logdet𝚲post=i=1Dlog(1+1D+ξi)+log(1-i=1D2{α21+1D+ξi}-i=1+D2D{11+1D+ξi})

By symmetry the optimum (it is easy to check that it is a maximum) of EIG(𝝃) will satisfy ξi=ξi+1 for i=1,,D2-1,D2+1,,D. In other words 𝝃 is entirely specified by ξ1 and ξD, which must satisfy ξ1+ξD=1 because of the constraint on the budget B=D2. Thus we have reduced the EIG maximization problem to a univariate optimization problem that can easily be solved to machine precision, for example by gradient methods or brute force bisection.

For each of the four methods, we fix the computational budget to 120 seconds per design optimization. For the gradient-based methods this corresponds to 10×103, 20×103, and 18×103 gradient steps for ACE, PCE, and Posterior, respectively. For the Bayesian Optimization baseline, we run 110 steps of a GP-UCB-like algorithm (srinivas2009gaussian) in batch-mode, resulting in a total budget of 1650 function evaluations of the EIG oracle. Note that for all four methods the runtime dependence on the dimension D is negligible in the regime in which we are operating; consequently we use the same number of gradient or BO steps for all D.

For the gradient-based methods, we use the kingma2014adam optimizer with default momentum hyperparameters and an initial learning rate of 0=0.1 that is exponentially decayed towards a final learning rate f that depends on the particular method. In particular we set f=1×10-4, f=1×10-5, and f=3×10-4 for the ACE, PCE, and Posterior methods, respectively. For the BO baseline, we used a Matern kernel with a fixed length scale =0.2. These hyperparameters were chosen by running a grid search with D=16 and choosing hyperparameters that minimized the mean absolute EIG error.

Finally we note that in Fig. 3 at each dimension D we normalize the EIG by the factor

Z=EIG(𝝃*)-EIG(𝝃uniform) (75)

where 𝝃* and 𝝃uniform are the optimal and uniform budget designs, respectively. Consequently after normalization the absolute error for the uniform budget design 𝝃uniform is equal to unity.

B.4 Biomolecular docking

For the docking model, we used the following independent priors

top Beta(25,75) (76)
bottom Beta(4,96) (77)
ee50 N(-50,152) (78)
slope N(-0.15,0.12). (79)

For the design ξ=(ξ1,,ξ100) we had 100 binary responses

yiBern(bottom+top-bottom1+e-(ξi-ee50)×slope). (80)

For gradient methods, we used the Adam optimizer (kingma2014adam) with learning rate 10-3 and default momentum parameters. For each method, we took 5×105 gradient steps (each method converged within this number of steps). The inference network was mean-field with the same distributional families as the prior. We used the following neural architecture

Operation Size Activation
Input H1 64 ReLU
H1 H2 64 ReLU
H2 top 2 Softplus
H2 bottom 2 Softplus
H2 ee50 mean 1 -
H2 ee50 s.d. 1 Softplus
H2 slope mean 1 -
H2 slope s.d. 1 Softplus

We used the following number of samples

Method Inner samples L Outer samples N
ACE 10 10
PCE 10 10
BA n/a 100

For the expert method, the design of lyu2019ultra, which comprised 580 compounds, was subsampled to comprise 100 compounds for a fair comparison.

For evaluation, we used ACE/VNMC, first training ACE for 25000 steps using the same learning rate as above. With the fixed inference network, we made ACE and VNMC evaluations using L=2×103 inner samples, N=4×106 outer samples—a total of 9 billion samples.

B.5 Constant elasticity of substitution

We used the exact set-up of foster2019variational. Specifically, we take U(𝐱)=(ixiραi)1/ρ and place the following priors on ρ,𝜶,u

ρ Beta(1,1) (81)
𝜶 Dirichlet([1,1,1]) (82)
logu N(1,3) (83)
η N(u(U(𝐱)-U(𝐱),0.0052u2(1+𝐱-𝐱)2) (84)
y =f(η) (85)

where f is the censored sigmoid function. All designs ξ=(𝐱,𝐱) were constrained to [0,100]6.

For gradient methods, we used the Adam optimizer with learning rate 10-3 and default momentum parameters. To make the design process 120 seconds per step, we used the following number of gradient steps

Method Number of steps
ACE 1500
PCE 2500
BA 5000

We found that there was insufficient time to effectively train a neural network guide. Instead we used a mean-field variational family with the same distributional families as the prior, and a linear model using the following features: logit(y),log|logit(y)|,𝟏(y>0.5).

We used the following number of samples

Method Inner samples L Outer samples N
ACE 10 10
PCE 10 10
BA n/a 100

For the baseline, we used the marginal upper bound of foster2019variational with the same variational family used in that paper—an f-transformed Normal with additional point masses at the end-points. We used a GP with a Matern52 kernel, lengthscale 20, variance set from data, and a GP-UCB1 algorithm to make acquisitions which were done in batches of 8.

At each stage, the posterior was fitted using mean-field variational inference using the same distributional families as the prior.