Gradient-based Adaptive Markov Chain Monte Carlo

  • 2019-11-04 18:03:06
  • Michalis K. Titsias, Petros Dellaportas
  • 15

Abstract

We introduce a gradient-based learning method to automatically adapt Markovchain Monte Carlo (MCMC) proposal distributions to intractable targets. Wedefine a maximum entropy regularised objective function, referred to asgeneralised speed measure, which can be robustly optimised over the parametersof the proposal distribution by applying stochastic gradient optimisation. Anadvantage of our method compared to traditional adaptive MCMC methods is thatthe adaptation occurs even when candidate state values are rejected. This is ahighly desirable property of any adaptation strategy because the adaptationstarts in early iterations even if the initial proposal distribution is farfrom optimum. We apply the framework for learning multivariate random walkMetropolis and Metropolis-adjusted Langevin proposals with full covariancematrices, and provide empirical evidence that our method can outperform otherMCMC algorithms, including Hamiltonian Monte Carlo schemes.

 

Quick Read (beta)

Gradient-based Adaptive Markov Chain Monte Carlo

Michalis K. Titsias
DeepMind
London, UK
[email protected] &Petros Dellaportas
Department of Statistical Science
University College of London, UK
Department of Statistics, Athens
Univ. of Econ. and Business, Greece
and The Alan Turing Institute, UK
Abstract

We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. We define a maximum entropy regularised objective function, referred to as generalised speed measure, which can be robustly optimised over the parameters of the proposal distribution by applying stochastic gradient optimisation. An advantage of our method compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. This is a highly desirable property of any adaptation strategy because the adaptation starts in early iterations even if the initial proposal distribution is far from optimum. We apply the framework for learning multivariate random walk Metropolis and Metropolis-adjusted Langevin proposals with full covariance matrices, and provide empirical evidence that our method can outperform other MCMC algorithms, including Hamiltonian Monte Carlo schemes.

 

Gradient-based Adaptive Markov Chain Monte Carlo


  Michalis K. Titsias DeepMind London, UK [email protected] Petros Dellaportas Department of Statistical Science University College of London, UK Department of Statistics, Athens Univ. of Econ. and Business, Greece and The Alan Turing Institute, UK

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

Markov chain Monte Carlo (MCMC) is a family of algorithms that provide a mechanism for generating dependent draws from arbitrarily complex distributions. The basic set up of an MCMC algorithm in any probabilistic (e.g. Bayesian) inference problem, with an intractable target density π(x), is as follows. A discrete time Markov chain {Xt}t=0 with transition kernel Pθ, appropriately chosen from a collection of π-invariant kernels {Pθ(,)}θΘ, is generated and the ergodic averages μt(F)=t-1i=0t-1F(Xi) are used as approximations to Eπ(F) for any real-valued function F. Although in principle this sampling setup is simple, the actual implementation of any MCMC algorithm requires careful choice of Pθ because the properties of μt depend on θ. In common kernels that lead to random walk Metropolis (RWM), Metropolis-adjusted Langevin (MALA) or Hamiltonian Monte Carlo (HMC) algorithms the kernels Pθ are specified through an accept-reject mechanism in which the chain moves from time t to time t+1 by first proposing candidate values y from a density qθ(y|x) and accepting them with some probability α(xt,y) and setting xt+1=y, or rejecting them and setting xt+1=xt. Since θ directly affects this acceptance probability, it is clear that one should choose θ so that the chain does not move too slowly or rejects too many proposed values y because in both these cases convergence to the stationary distribution will be slow. This has been recognised as early as in (metropolis1953equation, ) and has initiated exciting research that has produced optimum average acceptance probabilities for a series of algorithms; see (roberts1997weak, ; roberts1998optimal, ; roberts2001optimal, ; haario2005componentwise, ; bedard2007weak, ; bedard2008optimal, ; roberts2009examples, ; bedard2008efficient, ; rosenthal2011optimal, ; beskos2013optimal, ). Such optimal average acceptance probabilities provide basic guidelines for adapting single step size parameters to achieve certain average acceptance rates.

More sophisticated adaptive MCMC algorithms that can learn a full set of parameters θ, such as a covariance matrix, borrow information from the history of the chain to optimise some criterion reflecting the performance of the Markov chain (haario2001adaptive, ; atchade2005adaptive, ; roberts2007coupling, ; giordani2010adaptive, ; andrieu2006ergodicity, ; andrieu2007efficiency, ; atchade2009adaptive, ). Such methods are typically non gradient-based and the basic strategy they use is to sequentially fit the proposal qθ(y|x) to the history of states xt-1,xt,, by ignoring also the rejected state values. This can result in very slow adaptation because the initial Markov chain simulations are based on poor initial θ and the generated states, from which θ is learnt, are highly correlated and far from the target. The authors in roberts2009examples call such adaptive strategies ‘greedy’ in the sense that they try to adapt too closely to initial information from the output and take considerable time to recover from misleading initial information.

In this paper, we develop faster and more robust gradient-based adaptive MCMC algorithms that make use of the gradient of the target, logπ(x), and they learn from both actual states of the chain and proposed (and possibly rejected) states. The key idea is to define and maximise w.r.t. θ an entropy regularised objective function that promotes high acceptance rates and high values for the entropy of the proposal distribution. This objective function, referred to as generalised speed measure, is inspired by the speed measure of the infinite-dimensional limiting diffusion process that captures the notion of speed in which a Markov chain converges to its stationary distribution (roberts2001optimal, ). We maximise this objective function by applying stochastic gradient variational inference techniques such as those based on the reparametrisation trick (Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ). An advantage of our algorithm compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. In fact, the adaptation can be faster when candidate values y are rejected since then we make always full use of the gradient logπ(y) evaluated at the rejected y. This allows the adaptation to start in early iterations even if the initial proposal distribution is far from optimum and the chain is not moving. We apply the method for learning multivariate RWM and MALA proposals where we adapt full covariance matrices parametrised efficiently using Cholesky factors. In the experiments we demonstrate our algorithms to multivariate Gaussian targets and Bayesian logistic regression and empirically show that they outperform several other baselines, including advanced HMC schemes.

2 Gradient-based adaptive MCMC

Assume a target distribution π(x), known up to some unknown normalising constant, where xn is the state vector. To sample from π(x) we consider the Metropolis-Hastings (M-H) algorithm that generates new states by sampling from a proposal distribution qθ(y|x), that depends on parameters θ, and accepts or rejects each proposed state by using the standard M-H acceptance probability

α(x,y;θ)=min{1,π(y)qθ(x|y)π(x)qθ(y|x)}. (1)

While the M-H algorithm defines a Markov chain that converges to the target distribution, the efficiency of the algorithm heavily depends on the choice of the proposal distribution qθ(x|y) and the setting of its parameters θ.

Here, we develop a framework for stochastic gradient-based adaptation or learning of qθ(x|y) that maximises an objective function inspired by the concept of speed measure that underlies the theoretical foundations of MCMC optimal tuning (roberts1997weak, ; roberts1998optimal, ). Given that the chain is at state x we would like: (i) to propose big jumps in the state space and (ii) accept these jumps with high probability. By assuming for now that the proposal has the standard random walk isotropic form, such that qσ(y|x)=𝒩(y|x,σ2I), the speed measure is defined as

sσ(x)=σ2×α(x;σ), (2)

where σ2 denotes the variance, also called step size, of the proposal distribution, while α(x;σ) is the average acceptance probability when starting at x, i.e. α(x;σ)=α(x,y;σ)qσ(y|x)𝑑y. To learn a good value for the step size we could maximise the speed measure in Eq. 2, which intuitively promotes high variance for the proposal distribution together with high acceptance rates. In the theory of optimal MCMC tuning, sσ(x) is averaged under the stationary distribution π(x) to obtain a global speed measure value sσ=π(x)sσ(x)𝑑x. For simple targets and with increasing dimension this measure is maximised so that σ2 is set to a value that leads to the acceptance probability 0.234 (roberts1997weak, ; roberts1998optimal, ). This subsequently leads to the popular heuristic for tuning random walk proposals: tune σ2 so that on average the proposed states are accepted with probability 1/4. Similar heuristics have been obtained for tuning the step sizes of more advanced schemes such as MALA and HMC, where 0.54 is considered optimal for MALA (roberts2001optimal, ) and 0.67 for HMC (neal2010, ; beskos2013optimal, ).

While the current notion of speed measure from Eq. 2 is intuitive, it is only suitable for tuning proposals having a single step size. Thus, in order to learn arbitrary proposal distributions qθ(y|x), where θ is a vector of parameters, we need to define suitable generalisations of the speed measure. Suppose, for instance, that we wish to tune a Gaussian with a full covariance matrix, i.e. qΣ(y|x)=𝒩(y|x,Σ). To achieve this we need to generalise the objective in Eq. 2 by replacing σ2 with some functional (Σ) that depends on the full covariance. An obvious choice is to consider the average distance ||y-x||2 given by the trace tr(Σ)=iσi2. However, this is problematic since it can lead to learning proposals with very poor mixing. To see this note that since the trace is the sum of variances it can obtain high values even when some of the components of x have very low variance, e.g. for some xi it holds σi20, which can result in very low sampling efficiency or even non-ergodicity. In order to define better functionals (Σ) we wish to exploit the intuition that for MCMC all components of x need to jointly perform (relative to their scale) big jumps, a requirement that is better captured by the determinant |Σ| or more generally by the entropy of the proposal distribution.

Therefore, we introduce a generalisation of the speed measure having the form,

sθ(x)=exp{βqθ(y|x)}×α(x;θ)=exp{βqθ(y|x)}×α(x,y;θ)qθ(y|x)𝑑y, (3)

where qθ(y|x)=-qθ(y|x)logqθ(y|x)𝑑y denotes the entropy of the proposal distribution and β>0 is an hyperparameter. Note that when the proposal distribution is a full Gaussian qΣ(y|x)=𝒩(y|x,Σ) then exp{βq(y|x)}=const×|Σ|β2 which depends on the determinant of Σ. sθ(x), referred to as generalised speed measure, trades off between high entropy of the proposal distribution and high acceptance probability. The hyperparameter β plays the crucial role of balancing the relative strengths of these terms. As discussed in the next section we can efficiently optimise β in order to achieve a desirable average acceptance rate.

In the following we make use of the above generalised speed measure to derive a variational learning algorithm for adapting the parameters θ using stochastic gradient-based optimisation.

2.1 Maximising the generalised speed measure using variational inference

During MCMC iterations we collect the pairs of vectors (xt,yt)t>0 where xt is the state of the chain at time t and yt the corresponding proposed next state, which if accepted then xt+1=yt. When the chain has converged each xt follows the stationary distribution π(x), otherwise it follows some distribution that progressively converges to π(x). In either case we view the sequence of pairs (xt,yt) as non-iid data based on which we wish to perform gradient-based updates of the parameters θ. In practice such updates can be performed with diminishing learning rates, or more safely completely stop after some number of burn-in iterations to ensure convergence. Specifically, given the current state xt we wish to take a step towards maximising sθ(xt) in Eq. 3 or equivalently its logarithm,

logsθ(xt)=logα(x,y;θ)qθ(y|xt)𝑑y+βqθ(y|xt). (4)

The second term is just the entropy of the proposal distribution, which typically will be analytically tractable, while the first term involves an intractable integral. To approximate the first term we work similarly to variational inference (Jordan1999learn, ; Bishop2006, ) and we lower bound it using Jensen’s inequality,

logsθ(xt)θ(xt) =qθ(y|xt)logmin{1,π(y)qθ(xt|y)π(xt)qθ(y|xt)}𝑑y+βqθ(y|xt) (5)
=qθ(y|xt)min{0,logπ(y)π(xt)+logqθ(xt|y)qθ(y|xt)}𝑑y+βqθ(y|xt). (6)

To take a step towards maximising θ we can apply standard stochastic variational inference techniques such as the score function method or the reparametrisation trick (Carbonetto2009, ; Paisley2012, ; Ranganath2014, ; Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ; Kucukelbir2017, ). Here, we focus on the case qθ(y|xt) is a reparametrisable distribution such that y=𝒯θ(xt,ϵ) where 𝒯θ is a deterministic transformation and ϵp(ϵ). θ(xt) can be re-written as

θ(xt) =p(ϵ)min{0,logπ(𝒯θ(xt,ϵ))π(xt)+logqθ(xt|𝒯θ(xt,ϵ))qθ(𝒯θ(xt,ϵ)|xt)}𝑑ϵ+βqθ(y|xt).

Since MCMC at the t-th iteration proposes a specific state yt constructed as ϵtp(ϵt), yt=𝒯θ(xt,ϵt), an unbiased estimate of the exact gradient θθ(xt) can be obtained by

θθ(xt,ϵt)=θmin{0,logπ(𝒯θ(xt,ϵt))π(xt)+logqθ(xt|𝒯θ(xt,ϵt))qθ(𝒯θ(xt,ϵt)|xt)}+βθqθ(y|xt),

which is used to make a gradient update for the parameters θ. Note that the first term in the stochastic gradient is analogous to differentiating through a rectified linear hidden unit (ReLu) in neural networks, i.e. if logπ(yt)π(xt)+logqθ(xt|yt)qθ(yt|xt)0 the gradient is zero (this is the case when yt is accepted with probability one), while otherwise the gradient of the first term reduces to

θlogπ(𝒯θ(xt,ϵt))+θlogqθ(xt|𝒯θ(xt,ϵt))qθ(𝒯θ(xt,ϵt)|xt).

The value of the hyperparameter β can allow to trade off between large acceptance probability and large entropy of the proposal distribution. Such hyperparameter cannot be optimised by maximising the variational objective θ (this typically will set β to a very small value so that the acceptance probability becomes close to one but the chain is not moving since the entropy is very low). Thus, β needs to be updated in order to control the average acceptance probability of the chain in order to achieve a certain desired value α*. The value of α* can be determined based on the specific MCMC proposal we are using and by following standard recommendations in the literature, as discussed also in the previous section. For instance, when we use RWM α* can be set to 1/4 (see Section 2.2), while for gradient-based MALA (see Section 2.3) α* can be set to 0.55.

Pseudocode for the general procedure is given by Algorithm 1. We set the learning rate ρt using RMSProp (Tieleman2012, ); at each iteration t we set ρt=η/(1+Gt), where η is the baseline learning rate, and the updates of Gt depend on the gradient estimate θθ(xt,ϵt) as Gt=0.9Gt+0.1[θθ(xt,ϵt)]2.

Algorithm 1 Gradient-based Adaptive MCMC
  Input: target π(x); reparametrisable proposal qθ(y|x) s.t. y=𝒯θ(x,ϵ), ϵp(ϵ); initial x0; desired average acceptance probability α*.
  Initialise θ, β=1.
  for t=1,2,3,, do
     : Propose ϵtp(ϵt), yt=𝒯θ(xt,ϵt).
     : Adapt θ: θθ+ρtθθ(xt,ϵt).
     : Accept or reject yt using the standard M-H ratio to obtain xt+1.
     : Set αt=1 if yt was accepted and αt=0 otherwise.
     : Adapt hyperparameter β: ββ[1+ρβ(αt-α*)] # default value for ρβ=0.02.
  end for

2.2 Fitting a full covariance Gaussian random walk proposal

We now specialise to the case the proposal distribution is a random walk Gaussian qL(y|x)=𝒩(y|x,LL) where the parameter L is a positive definite lower triangular matrix, i.e. a Cholesky factor. This distribution is reparametrisable since y𝒯L(x,ϵ)=x+Lϵ, ϵ𝒩(ϵ|0,I). At the t-th iteration when the state is xt the lower bound becomes

L(xt) =𝒩(ϵ|0,I)min{0,logπ(xt+Lϵ)-logπ(xt)}𝑑ϵ+βi=1nlogLii+const. (7)

Here, the proposal distribution has cancelled out from the M-H ratio, since it is symmetric, while qθ(y|xt)=log|L|+const and log|L|=i=1nlogLii. By making use of the MCMC proposed state yt=xt+Lϵt we can obtain an unbiased estimate of the exact gradient LL(xt),

LL(xt,ϵt)={[ytlogπ(yt)×ϵt]lower+βdiag(1L11,,1Lnn),iflogπ(yt)<logπ(xt)βdiag(1L11,,1Lnn),otherwise

where yt=xt+Lϵt, the operation [A]lower zeros the upper triangular part (above the main diagonal) of a squared matrix and diag(1L11,,1Lnn) is a diagonal matrix with elements 1/Lii. The first case of this gradient, i.e. when logπ(yt)<logπ(xt), has a very similar structure with the stochastic reparametrisation gradient when fitting a variational Gaussian approximation (Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ) with the difference that here we centre the corresponding approximation, i.e. the proposal qL(yt|xt), at the current state xt instead of having a global variational mean parameter. Interestingly, this first case when MCMC rejects many samples (or even it gets stuck at the same value for long time) is when learning can be faster since the term ytlogπ(yt)×ϵt transfers information about the curvature of the target to the covariance of the proposal. When we start getting high acceptance rates the second case, i.e. logπ(yt)logπ(xt), will often be activated so that the gradient will often reduce to only having the term βdiag(1L11,,1Lnn) that encourages the entropy of the proposal to become large. The ability to learn from rejections is in sharp contrast with the traditional non gradient-based adaptive MCMC methods which can become very slow when MCMC has high rejection rates. This is because these methods typically learn from the history of state vectors xt by ignoring the information from the rejected states. The algorithm for learning the full random walk Gaussian follows precisely the general structure of Algorithm 1. For the average acceptance rate α* we use the value 1/4.

2.3 Fitting a full covariance MALA proposal

Here, we specialise to a full covariance, also called preconditioned, MALA of the form qL(y|x)=𝒩(y|x+(1/2)LLxlogπ(x),LL) where the covariance matrix is parametrised by the Cholesky factor L. Again this distribution is reparametrisable according to y𝒯L(x,ϵ)=x+(1/2)LLlogπ(x)+Lϵ, ϵ𝒩(ϵ|0,I). At the t-th iteration when the state is xt the reparametrised lower bound simplifies significantly and reduces to,

L(xt) =𝒩(ϵ|0,I)min{0,logπ(xt+(1/2)LLlogπ(xt)+Lϵ)-logπ(xt).
. -12(||(1/2)L[logπ(xt)+logπ(y)]+ϵ||2-||ϵ||2)}dϵ+βi=1nlogLii+const,

where |||| denotes Euclidean norm and in the term logπ(y), y=xt+(1/2)LLlogπ(xt)+Lϵ. Then, based on the proposed state yt=𝒯L(xt,ϵt) we can obtain the unbiased gradient estimate L(xt,ϵt) similarly to the previous section. In general, such an estimate can be very expensive because the existence of L inside logπ(yt) means that we need to compute the matrix of second derivatives or Hessian logπ(yt). We have found that an alternative procedure which stops the gradient inside logπ(yt) (i.e. it views logπ(yt) as a constant w.r.t. L) has small bias and works well in practice. In fact, as we will show in the experiments this approximation not only is computationally much faster but remarkably also it leads to better adaptation compared to the exact Hessian procedure, presumably because by not accounting for the gradient inside logπ(yt) reduces the variance. Furthermore, the expression of the gradient w.r.t. L used by this fast approximation can be computed very efficiently with a single O(n2) operation (an outer vector product; see Supplement), while each iteration of the algorithm requires overall at most four O(n2) operations. For these gradient-based adaptive MALA schemes, β in Algorithm 1 is adapted to obtain an average acceptance rate roughly α*=0.55.

3 Related Work

Connection of our method with traditional adaptive MCMC methods has been discussed in Section 1. Here, we analyse additional related works that make use of gradient-based optimisation and specialised objective functions or algorithms to train MCMC proposal distributions.

The work in levy2018generalizing proposed a criterion to tune MCMC proposals based on maximising a modified version of the expected squared jumped distance, qθ(y|xt)||y-xt||2α(xt,y;θ)𝑑y, previously considered in pasarica2010adaptively . Specifically, the authors in levy2018generalizing firstly observe that the expected squared jumped distance may not encourage mixing across all dimensions of x11 1 Because the additive form of ||y-xt||2=i(yi-xti)2 implies that even when some dimensions might not be moving at all (the corresponding distance terms are zero), the overall sum can still be large. and then try to resolve this by including a reciprocal term (see Section 4.2 in their paper). The generalised speed measure proposed in this paper is rather different from such criteria since it encourages joint exploration of all dimensions of x by applying maximum entropy regularisation, which by construction penalises "dimensions that do not move" since the entropy becomes minus infinity in such cases. Another important difference is that in our method the optimisation is performed in the log space by propagating gradients through the logarithm of the M-H acceptance probability, i.e. through logα(xt,y;θ) and not through α(xt,y;θ). This is exactly analogous to other numerically stable objectives such as variational lower bounds and log likelihoods, and as those our method leads to numerically stable optimisation for arbitrarily large dimensionality of x and complex targets π(x).

In another related work, the authors in neklyudov2018metropolis considered minimising the KL divergence KL[π(xt)qθ(yt|xt)||π(yt)qθ(xt|yt)]. However, this loss for standard proposal schemes, such as RWM and MALA, leads to degenerate deterministic solutions where qθ(yt|xt) collapses to a delta function. Therefore, neklyudov2018metropolis maximised this objective for the independent M-H sampler where the collapsing problem does not occur. The entropy regularised objective we introduced is different and it can adapt arbitrary MCMC proposal distributions, and not just the independent M-H sampler.

There has been also work to learn flexible MCMC proposals using neural networks (song2017nice, ; levy2018generalizing, ; habib2018auxiliary, ; Salimans2015, ). For instance, song2017nice use volume preserving flows and an adversarial objective, levy2018generalizing use the modified expected jumped distance, discussed earlier, to learn neural network-based extensions of HMC, while habib2018auxiliary ; Salimans2015 use auxiliary variational inference. The need to train neural networks can add a significant computational cost, and from the end-user point of view these neural adaptive samplers might be hard to tune especially in high dimensions. Notice that the generalised speed measure we proposed in this paper could possibly be used to train neural adaptive samplers as well. However, to really obtain practical algorithms we need to ensure that training has small cost that does not overwhelm the possible benefits in terms of effective sample size.

Finally, the generalised speed measure that is based on entropy regularisation shares similarities with other used objectives for learning probability distributions, such as in variational Bayesian inference, where the variational lower bound includes an entropy term (Jordan1999learn, ; Bishop2006, ) and reinforcement learning (RL) where maximum-entropy regularised policy gradients are able to estimate more explorative policies (schulman15, ; mniha16, ). Further discussion on the resemblance of our algorithm with RL is given in the Supplement.

4 Experiments

We test the gradient-based adaptive MCMC methods in several simulated and real data. We investigate the performance of two instances of the framework: the gradient-based adaptive random walk (gadRWM) detailed in Section 2.2 and the corresponding MALA (gadMALA) detailed in Section 2.3. For gadMALA we consider the exact reparametrisation method that requires the evaluation of the Hessian (gadMALAe) and the fast approximate variant (gadMALAf). These schemes are compared against: (i) standard random walk Metropolis (RWM) with proposal 𝒩(y|x,σ2I), (ii) an adaptive MCMC (AM) that fits a proposal of the form 𝒩(y|x,Σ) (we consider a computational efficient version based on updating the Cholesky factor of Σ; see Supplement), (iii) a standard MALA proposal 𝒩(y|x+(1/2)σ2logπ(x),σ2I), (iv) an Hamiltonian Monte Carlo (HMC) with a fixed number of leap frog steps (either 5, or 10, or 20) (v) and the state of the art no-U-turn sampler (NUTS) (hoffman2014no, ) which arguably is the most efficient adaptive HMC method that automatically determines the number of leap frog steps. We provide our own MALTAB implementation22 2 https://github.com/mtitsias/gadMCMC. of all algorithms, apart from NUTS for which we consider a publicly available implementation.

4.1 Illustrative experiments

To visually illustrate the gradient-based adaptive samplers we consider a correlated 2-D Gaussian target with covariance matrix Σ=[1 0.99;0.99 1] and a 51-dimensional Gaussian target obtained by evaluating the squared exponential kernel plus small white noise, i.e. k(xi,xj)=exp{-12(xi-xj)20.16}+0.01δi,j, on the regular grid [0,4]. The first two panels in Figure 1 show the true covariance together with the adapted covariances obtained by gadRWM for two different settings of the average acceptance rate α* in Algorithm 1, which illustrates also the adaptation of the entropy-regularisation hyperparameter β that is learnt to obtain a certain α*. The remaining two plots illustrate the ability to learn a highly correlated 51-dimensional covariance matrix (with eigenvalues ranging from 0.01 to 12.07) by applying our most advanced gadMALAf scheme.

Figure 1: The green contours in the first two panels (from left to right) show the 2-D Gaussian target, while the blue contours show the learned covariance, LL, after adapting for 2×104 iterations using gadRWM and targeting acceptance rates α*=0.25 and α*=0.4, respectively. For α*=0.25 the adapted blue contours show that the proposal matches the shape of the target but it has higher entropy/variance and the hyperparameter β obtained the value 7.4. For α*=0.4 the blue contours shrink a bit and β is reduced to 2.2 (since higher acceptance rate requires smaller entropy). The third panel shows the exact 51×51 covariance matrix and the last panel shows the adapted one, after running our most efficient gadMALAf scheme for 2×105 iterations. In both experiments L was initialised to diagonal matrix with 0.1/n in the diagonal.

4.2 Quantitative results

Here, we compare all algorithms in some standard benchmark problems, such as Bayesian logistic regression, and report effective sample size (ESS) together with other quantitative scores.

Figure 2: Panels in the first row show trace plots, obtained by different schemes, across the last 2×104 sampling iterations for the most difficult to sample x100 dimension. The panels in the second row show the estimated values of the diagonal of L obtained by different adaptive schemes. Notice that the real Gaussian target has diagonal covariance matrix Σ=diag(s12,,s1002) where si are uniform in the range [0.01,1].
Table 1: Comparison in Neal’s Gaussian example (dimensionality was n=100; see panel above) and Caravan binary classification dataset where the latter consists of 5822 data points (dimensionality was n=87; see panel below). All numbers are averages across ten repeats where also one-standard deviation is given for the Min ESS/s score. From the three HMC schemes we report only the best one in each case.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
(Neal’s Gaussian)
gadMALAf 8.7 0.556 (1413.4, 1987.4, 2580.8) 161.70 (15.07)
gadMALAe 10.0 0.541 (922.2, 2006.3, 2691.1) 92.34 (7.11)
gadRWM 7.0 0.254 (27.5, 66.9, 126.9) 3.95 (0.66)
AM 2.3 0.257 (8.7, 48.6, 829.1) 3.71 (0.87)
RWM 2.2 0.261 (2.9, 8.4, 2547.6) 1.31 (0.06)
MALA 3.1 0.530 (2.9, 10.0, 12489.2) 0.95 (0.03)
HMC-20 49.7 0.694 (306.1, 1537.8, 19732.4) 6.17 (3.35)
NUTS 360.5 >0.7 (18479.6, 20000.0, 20000.0) 51.28 (1.64)
(Caravan)
gadMALAf 23.1 0.621 (228.1, 750.3, 1114.7) 9.94 (2.64)
gadMALAe 95.1 0.494 (66.6, 508.3, 1442.7) 0.70 (0.16)
gadRWM 22.6 0.234 (5.3, 34.3, 104.5) 0.23 (0.06)
AM 20.0 0.257 (3.2, 11.8, 62.5) 0.16 (0.01)
RWM 15.3 0.242 (3.0, 9.3, 52.5) 0.20 (0.03)
MALA 22.8 0.543 (4.4, 28.3, 326.0) 0.19 (0.05)
HMC-10 225.5 0.711 (248.3, 2415.7, 19778.7) 1.10 (0.12)
NUTS 1412.1 >0.7 (7469.5, 20000.0, 20000.0) 5.29 (0.38)

Experimental settings. In all experiments for AM and gradient-based adaptive schemes the Cholesky factor L was initialised to a diagonal matrix with values 0.1/n in the diagonal where n is the dimensionality of x. For the AM algorithm the learning rate was set to 0.001/(1+t/T) where t is the number of iterations and T (the value 4000 was used in all experiments) serves to keep the learning rate nearly constant for the first T training iterations. For the gradient-based adaptive algorithms we use RMSprop (see Section 2.1) where η was set to 0.00005 for gadRWM and to 0.00015 for the gadMALA schemes. NUTS uses its own fully automatic adaptive procedure that determines both the step size and the number of leap frog steps (hoffman2014no, ). For all experiments and algorithms (apart from NUTS) we consider 2×104 burn-in iterations and 2×104 iterations for collecting samples. This adaptation of L or σ2 takes place only during the burn-in iterations and then it stops, i.e. at collection of samples stage these parameters are kept fixed. For NUTS, which has its own internal tuning procedure, 500 burn-in iterations are sufficient before collecting 2×104 samples. The computational time for all algorithms reported in the tables corresponds to the overall running time, i.e. the time for performing jointly all burn-in and collection of samples iterations.

Neal’s Gaussian target. We first consider an example used in (neal2010, ) where the target is a zero-mean multivariate Gaussian with diagonal covariance matrix Σ=diag(s12,,s1002) where the stds si take values in the linear grid 0.01,0.02,,1. This is a challenging example because the different scaling of the dimensions means that the schemes that use an isotropic step σ2 will be adapted to the smallest dimension x1 while the chain at the higher dimensions, such as x100, will be moving slowly exhibiting high autocorrelation and small effective sample size. The first row of Figure 3 shows the trace plot across iterations of the dimension x100 for some of the adaptive schemes including an HMC scheme that uses 20 leap frog steps. Clearly, the gradient-based adaptive methods show much smaller autocorrelation that AM. This is because they achieve a more efficient adaptation of the Cholesky factor L which ideally should become proportional to a diagonal matrix with the linear grid 0.01,0.02,,1 in the main diagonal. The second row of Figure 3 shows the diagonal elements of L from which we can observe that all gradient-based schemes lead to more accurate adaptation with gadMALAf being the most accurate.

Furthermore, Table 1 provides quantitative results such as minimum, median and maximum ESS computed across all dimensions of the state vector x, running times and an overall efficiency score Min ESS/s (i.e. ESS for the slowest moving component of x divided by running time – last column in the Table) which allows to rank the different algorithms. All results are averages after repeating the simulations 10 times under different random initialisations. From the table it is clear that the gadMALA algorithms give the best performance with gadMALAf being overall the most effective.

Bayesian logistic regression. We consider binary classification where given a set of training examples {yi,si}i=1n we assume a logistic regression likelihood p(y|w,s)=i=1nyilogσ(si)+(1-yi)log(1-σ(si)), where σ(si)=1/(1+exp(-wsi)), si is the input vector and w the parameters. We place a Gaussian prior on w and we wish to sample from the posterior distribution over w. We considered six binary classification datasets (Australian Credit, Heart, Pima Indian, Ripley, German Credit and Caravan) with a number of examples ranging from n=250 to n=5822 and dimensionality of the state/parameter vector ranging from 3 to 87. Table 1 shows results for the most challenging Caravan dataset where the dimensionality of w is 87, while the remaining five tables are given in the Supplement. From all tables we observe that the gadMALAf is the most effective and it outperforms all other methods. While NUTS has always very high ESS is still outperformed by gadMALAf because of the high computational cost, i.e. NUTS might need to use a very large number of leap frog steps (each requiring re-evaluating the gradient of the log target) per iteration. Further results, including a higher 785-dimensional example on MNIST, are given in the Supplement.

5 Conclusion

We have presented a new framework for gradient-based adaptive MCMC that makes use of an entropy-regularised objective function that generalises the concept of speed measure. We have applied this method for learning RWM and MALA proposals with full covariance matrices.

Some topics for future research are the following. Firstly, to deal with very high dimensional spaces it would be useful to consider low rank parametrisations of the covariance matrices in RWM and MALA proposals. Secondly, it would be interesting to investigate whether our method can be used to tune the so-called mass matrix in HMC samplers. However, in order for this to lead to practical and scalable algorithms we have to come up with schemes that avoid the computation of the Hessian, as we successfully have done for MALA. Finally, in order to reduce the variance of the stochastic gradients and speed up further the adaptation, especially in high dimensions, our framework could be possibly combined with parallel computing as used for instance in deep reinforcement learning espeholt18a .

References

  • [1] Christophe Andrieu and Yves Atchade. On the efficiency of adaptive mcmc algorithms. Electronic Communications in Probability, 12:336–349, 2007.
  • [2] Christophe Andrieu and Éric Moulines. On the ergodicity properties of some adaptive mcmc algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
  • [3] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and Computing, 18(4):343–373, December 2008.
  • [4] Yves Atchade, Gersende Fort, Eric Moulines, and Pierre Priouret. Adaptive markov chain monte carlo: theory and methods. Preprint, 2009.
  • [5] Yves F Atchadé and Jeffrey S Rosenthal. On adaptive markov chain monte carlo algorithms. Bernoulli, 11(5):815–828, 2005.
  • [6] Mylène Bédard. Weak convergence of metropolis algorithms for non-iid target distributions. The Annals of Applied Probability, pages 1222–1244, 2007.
  • [7] Mylène Bédard. Efficient sampling using metropolis algorithms: Applications of optimal scaling results. Journal of Computational and Graphical Statistics, 17(2):312–332, 2008.
  • [8] Mylene Bedard. Optimal acceptance rates for metropolis algorithms: Moving beyond 0.234. Stochastic Processes and their Applications, 118(12):2198–2222, 2008.
  • [9] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid monte carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
  • [10] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • [11] P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.
  • [12] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1407–1416, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [13] Paolo Giordani and Robert Kohn. Adaptive independent metropolis–hastings by fast estimation of mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243–259, 2010.
  • [14] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
  • [15] Heikki Haario, Eero Saksman, and Johanna Tamminen. Componentwise adaptation for high dimensional mcmc. Computational Statistics, 20(2):265–273, 2005.
  • [16] Raza Habib and David Barber. Auxiliary variational mcmc. To appear at ICLR 2019, 2019.
  • [17] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
  • [18] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, USA, 1999.
  • [19] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • [20] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1–45, 2017.
  • [21] Daniel Levy, Matt D. Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. In International Conference on Learning Representations, 2018.
  • [22] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • [23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  • [24] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
  • [25] Kirill Neklyudov, Pavel Shvechikov, and Dmitry Vetrov. Metropolis-hastings view on variational inference and adversarial training. arXiv preprint arXiv:1810.07151, 2018.
  • [26] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
  • [27] Cristian Pasarica and Andrew Gelman. Adaptively scaling the metropolis algorithm using expected squared jumped distance. Statistica Sinica, pages 343–364, 2010.
  • [28] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
  • [29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • [30] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and optimal scaling of random walk metropolis algorithms. The annals of applied probability, 7(1):110–120, 1997.
  • [31] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
  • [32] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-hastings algorithms. Statistical Science, pages 351–367, 2001.
  • [33] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive markov chain monte carlo algorithms. Journal of applied probability, 44(2):458–475, 2007.
  • [34] Gareth O Roberts and Jeffrey S Rosenthal. Examples of adaptive mcmc. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
  • [35] Jeffrey S Rosenthal. Optimal proposal distributions and adaptive mcmc. In Handbook of Markov Chain Monte Carlo, pages 114–132. Chapman and Hall/CRC, 2011.
  • [36] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, 2015.
  • [37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • [38] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
  • [39] T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 4, 2012.
  • [40] M. K. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.

Supplement:

Gradient-based Adaptive Markov Chain Monte Carlo

Appendix A Gradient-based adaptive MCMC as Reinforcement Learning

A MDP is a tuple (𝒳,𝒴,𝒫,) where 𝒳 is the state space, 𝒴 is the action space, 𝒫 is the transition distribution with density p(xt+1|xt,yt) that describes how the next state xt+1 is generated given that currently we are at state xt and we take action yt. Further, the reward function R(xt,yt) provides some instantaneous or local signal about how good the action yt was when being at xt. Furthermore, in a MDP we have also a policy π(yt|xt) which is a distribution over actions given states and it fully describes the behaviour of the agent. Given that we start at x0 we wish to specify the policy so that to maximise future reward, such as the expected accumulated discounted reward

𝔼π[t=0γtR(xt,yt)],γ[0,1].

Suppose now a MCMC procedure targeting π(x), where x𝒳 is the state vector. Consider a proposal distribution qθ(y|x), such that the standard Metropolis-Hastings algorithm accepts each proposed state ytqθ(yt|xt) with probability

α(xt,yt;θ)=min(1,π(yt)π(xt)qθ(xt|yt)qθ(yt|xt)), (8)

so that xt+1=yt, while if the proposal is rejected, xt+1=xt. To reformulate MCMC as an MDP we make the following correspondences. Firstly, both the state xt and the action yt will live in the same space which will be the state space 𝒳 of the target distribution. The MCMC proposal qθ(yt|xt) will correspond to the policy π(yt|xt), while the environmental transition dynamics will be stochastic and given by the two-component mixture,

p(xt+1|xt,yt)=α(xt,yt;θ)δxt+1,yt+(1-α(xt,yt;θ))δxt+1,xt,

where δx,y denotes the delta function. This transition density simply says that the new state xt+1 with probability α(xt,yt;θ) will be equal to the proposed action yt, while with the remaining probability will be set to the previous state, i.e. xt+1=xt. Notice that the standard MCMC transition kernel Kθ(xt,xt+1) is obtained by integrating out the action yt, i.e.

Kθ(xt,xt+1) =p(xt+1|xt,yt)qθ(yt|xt)𝑑yt
=α(xt,xt+1)q(xt+1|xt)+(1-α(xt,yt)qθ(yt|xt)𝑑yt)δxt+1,xt. (9)

The final ingredient we need to reformulate MCMC as MDP is the reward function R(xt,yt). The gradient-based adaptive MCMC method essentially assumes as reward

R(yt,xt;θ)=logα(xt,yt;θ)-βlogqθ(yt|xt),

which is an entropy-regularised reward that promotes high exploration with the entropic term -βlogqθ(yt|xt). Gradient-based adaptive MCMC essentially at each step stochastically maximises the expected reward starting from state xt, i.e.

qθ(yt|xt)R(yt,xt;θ)𝑑yt.

While the above reformulates MCMC as a reinforcement learning (RL) problem, there are clearly also some differences with standard RL problems. Given that the reward R(yt,xt;θ) is very informative (we are not facing the delayed-reward problem commonly encountered in standard RL) gradient-based MCMC sets γ=0 in order to maximise immediate reward. Further, the transition dynamics p(xt+1|xt,yt) are known in MCMC, while this typically is not the case in standard RL. Finally, notice that the reward R(yt,xt;θ) as well as the transition dynamics p(xt+1|xt,yt) all depend on the parameter θ and the policy qθ(yt|xt), i.e. they depend on the MCMC proposal distribution.

Appendix B Further details about the algorithms

For the standard adaptive MCMC method (AM) we implemented a computational efficient version that requires no matrix decompositions (which are expensive due to the O(n3) scaling) by parametrising the proposal as 𝒩(y|x,LL) and updating the Cholesky factor in each iteration according to the updates

μμ+ρt(xt+1-μ),
LL+ρtL[L-1(xt+1-μ)(xt+1-μ)L--I]lower,

where μ tracks the global mean of the state vector. Further details about this scheme can be found in Section 5.1.1 in [3].

For our most efficient gadMALAf scheme the stochastic gradient in each iteration is

L(xt,ϵt) =Lmin{0,logπ(xt+(1/2)LLlogπ(xt)+Lϵt)-logπ(xt).
. -12(||(1/2)L[logπ(xt)+logπ(yt)]+ϵt||2-||ϵt||2)}+βLi=1nlogLii,

where, as discussed in the main paper, logπ(yt) is taken as constant w.r.t. L. Then the gradient of the M-H log ratio (when this log ratio is negative, since otherwise its gradient is zero) simplifies as

Llogπ(xt+(1/2)LLlogπ(xt)+Lϵt)-12L||(1/2)L[logπ(xt)+logπ(yt)]+ϵt||2
=-12(logπ(xt)-logπ(yt))((1/2)L[logπ(xt)-logπ(yt)]+ϵt)

and then take the lower triangular part. This is just an outer vector product that scales as O(n2). Overall each iteration of the algorithm can be implemented (plus the extra overhead of a single gradient evaluation logπ(yt) of the log target at the proposed state yt) by using at most four O(n2) operations during adaptation and exactly two O(n2) operations after burn-in, as shown in the released code.

Appendix C Extra results on the Neal’s Gaussian target

Figure 3 shows trace plot for the log density of the target for all different algorithms.

Figure 3: The evolution of the log-target across iterations for all algorithms in Neal’s Gaussian example.

Appendix D Extra results on the binary classification datasets

Tables 2-6 show the results for the remaining five binary classification datasets not reported in the main article. Figures 4 and 5 show trace plot of log density of the target acrorss all different algorithms for the German Credit and Caravan datasets. For the remaining datasets the corresponding plots are similar.

Table 2: Comparison of sampling methods in Australian Credit dataset consisted of 690 data points. The size of the state/parameter vector from which we draw samples was n=15.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 8.1 0.569 (3485.9, 4262.9, 4784.0) 443.97 (76.13)
gadMALAe 13.5 0.540 (3034.9, 4234.3, 4836.3) 227.61 (29.87)
gadRWM 7.7 0.253 (288.0, 423.0, 515.0) 38.68 (9.53)
AM 4.4 0.261 (310.9, 410.1, 507.2) 70.21 (6.23)
RWM 3.4 0.252 (31.3, 312.6, 495.2) 9.16 (3.12)
MALA 7.0 0.524 (138.4, 2388.1, 3818.8) 20.22 (5.14)
HMC-5 37.0 0.700 (1048.1, 3510.3, 14809.7) 28.06 (11.69)
NUTS 41.3 >0.7 (2995.2, 20000.0, 20000.0) 72.86 (7.31)
Table 3: Comparison of sampling methods in Ripley dataset consisted of 250 data points. The size of the state/parameter vector from which we draw samples was n=3.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 3.3 0.536 (8328.4, 8913.2, 9442.4) 2506.04 (143.47)
gadMALAe 4.9 0.543 (8446.7, 9006.6, 9595.6) 1713.44 (44.91)
gadRWM 3.1 0.068 (638.0, 736.9, 803.2) 205.99 (17.85)
AM 3.0 0.257 (1702.8, 1792.2, 1902.0) 570.19 (49.32)
RWM 2.1 0.252 (1129.2, 1627.8, 1979.8) 534.21 (43.54)
MALA 2.8 0.542 (2976.0, 5683.0, 9726.5) 1046.48 (54.86)
HMC-5 14.7 0.678 (9205.3, 10818.1, 16136.5) 626.55 (196.48)
NUTS 7.5 >0.7 (9436.2, 17463.5, 20000.0) 1265.99 (73.01)
Table 4: Comparison of sampling methods in Pima Indian dataset consisted of 532 data points. The size of the state/parameter vector from which we draw samples was n=8.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 4.6 0.545 (5407.6, 5810.3, 6467.6) 1176.12 (79.54)
gadMALAe 6.8 0.547 (5469.6, 5963.6, 6421.1) 801.03 (16.07)
gadRWM 4.2 0.267 (635.6, 760.0, 866.2) 150.70 (9.73)
AM 4.1 0.273 (612.7, 729.1, 854.8) 149.18 (10.40)
RWM 3.2 0.246 (354.6, 496.4, 709.6) 111.81 (6.16)
MALA 4.0 0.509 (1524.9, 2457.2, 3853.6) 377.17 (25.80)
HMC-5 20.3 0.711 (7295.7, 12798.7, 18267.4) 359.22 (103.55)
NUTS 15.2 >0.7 (15343.3, 18606.0, 20000.0) 1008.97 (42.33)
Table 5: Comparison of sampling methods in Heart dataset consisted of 270 data points. The size of the state/parameter vector from which we draw samples was n=14.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 4.1 0.551 (3892.9, 4362.7, 4784.2) 946.98 (56.10)
gadMALAe 6.4 0.560 (3832.4, 4372.3, 4845.6) 599.51 (30.00)
gadRWM 3.8 0.288 (342.5, 440.9, 536.1) 88.94 (10.29)
AM 3.2 0.238 (342.5, 425.5, 535.4) 106.97 (7.18)
RWM 2.3 0.266 (196.9, 314.3, 472.7) 86.57 (11.33)
MALA 3.5 0.530 (1429.7, 2310.6, 3260.4) 404.96 (18.57)
HMC-5 18.4 0.699 (1913.2, 5600.3, 11883.0) 103.81 (39.38)
NUTS 15.4 >0.7 (20000.0, 20000.0, 20000.0) 1295.13 (15.74)
Table 6: Comparison of sampling methods in German Credit dataset consisted of 1000 data points. The size of the state/parameter vector from which we draw samples was n=25.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 11.0 0.560 (2734.9, 3414.5, 3928.6) 252.91 (37.86)
gadMALAe 22.4 0.549 (2808.2, 3384.9, 3883.5) 126.00 (14.68)
gadRWM 10.4 0.248 (179.1, 252.5, 323.1) 17.92 (4.18)
AM 12.6 0.262 (121.9, 207.6, 308.0) 9.72 (1.25)
RWM 8.4 0.233 (45.0, 153.8, 298.7) 5.48 (1.83)
MALA 9.2 0.535 (420.1, 1313.2, 2573.7) 47.37 (10.22)
HMC-5 43.4 0.706 (3020.2, 10294.4, 20000.0) 71.62 (49.62)
NUTS 47.4 >0.7 (7737.2, 20000.0, 20000.0) 166.93 (30.57)
Figure 4: The evolution of the log-target across iterations for all algorithms in German Credit dataset.
Figure 5: The evolution of the log-target across iterations for all algorithms in Caravan dataset.

Appendix E Results on a higher dimensional example

We all tried a much larger Bayesian binary classification problem by taking all 11339 training examples of "5" and "6" MNIST digits which are 28×28 images and therefore the dimensionality of the parameter vector w was 785 (the plus one accounts for the bias term). For this larger example from the baselines we applied the gradient-based schemes, MALA, HMC and NUTS since the other methods become very inefficient. From the proposed schemes we applied the most efficient algorithm which is gadMALAf. Also because of the much higher dimensionality of this problem, which makes the stochastic optimisation over L harder, we had to decrease the baseline learning rate in the RMSprop schedule from 0.00015 to 0.00001. We also considered a larger adaptation phase consisted of 5×104 instead of 2×104. All other algorithms use the same experimental settings as described in the main paper. Figure 7 shows the evolution of the log-target densities for all sampling schemes while Table 7 gives ESS, computation times and other statistics.

We can observe that the performance of gadMALAf is reasonably good: it outperforms all methods apart form NUTS. NUTS is better in this example, but it takes around 22 hours to run (since it performs on average 550 gradient evaluations per sampling iteration). Finally, to visualise some part of the learned L found by gadMALAf, Figure 7 depicts the 784 diagonal elements of L as an 28×28 grey-scale image. Clearly, gadMALAf manages to perform a sort of feature selection, i.e. to discover that the border pixels in MNIST digits do not really take part in the classification, so it learns a much higher variance for those dimensions (close to the variance of the prior).

Figure 6: The evolution of the log-target across iterations for all algorithms in binary MNIST classification over "5" versus "6".
Table 7: Comparison of sampling methods in binary MNIST dataset, of "5" versus "6", consisted of 11339 data points. The size of the state/parameter vector from which we draw samples was n=785. All numbers are averages across five repeats where also one-standard deviation is given for the Min ESS/s score.
Method Time(s) Accept Rate ESS (Min, Med, Max) Min ESS/s (1 st.d.)
gadMALAf 779.3 0.575 (46.0, 128.7, 282.8) 0.059 (0.00)
MALA 311.8 0.530 (2.8, 5.9, 28.4) 0.009 (0.00)
HMC-5 1847.4 0.733 (4.5, 23.1, 162.7) 0.002 (0.00)
HMC-10 3381.3 0.589 (13.9, 66.5, 576.0) 0.004 (0.00)
HMC-20 6449.1 0.666 (77.8, 240.1, 2060.9) 0.012 (0.00)
NUTS 83232.1 >0.7 (18514.1, 20000.0, 20000.0) 0.223 (0.01)
Figure 7: The first 784 diagonal elements (i.e. excluding the bias component of x) of the full 785×785 Cholesky factor L found after 5×104 adapting iterations by gadMALAf. Brighter/white colour means larger values.