### Abstract

We introduce a gradient-based learning method to automatically adapt Markovchain Monte Carlo (MCMC) proposal distributions to intractable targets. Wedefine a maximum entropy regularised objective function, referred to asgeneralised speed measure, which can be robustly optimised over the parametersof the proposal distribution by applying stochastic gradient optimisation. Anadvantage of our method compared to traditional adaptive MCMC methods is thatthe adaptation occurs even when candidate state values are rejected. This is ahighly desirable property of any adaptation strategy because the adaptationstarts in early iterations even if the initial proposal distribution is farfrom optimum. We apply the framework for learning multivariate random walkMetropolis and Metropolis-adjusted Langevin proposals with full covariancematrices, and provide empirical evidence that our method can outperform otherMCMC algorithms, including Hamiltonian Monte Carlo schemes.

### Quick Read (beta)

# Gradient-based Adaptive Markov Chain Monte Carlo

###### Abstract

We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. We define a maximum entropy regularised objective function, referred to as generalised speed measure, which can be robustly optimised over the parameters of the proposal distribution by applying stochastic gradient optimisation. An advantage of our method compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. This is a highly desirable property of any adaptation strategy because the adaptation starts in early iterations even if the initial proposal distribution is far from optimum. We apply the framework for learning multivariate random walk Metropolis and Metropolis-adjusted Langevin proposals with full covariance matrices, and provide empirical evidence that our method can outperform other MCMC algorithms, including Hamiltonian Monte Carlo schemes.

Gradient-based Adaptive Markov Chain Monte Carlo

Michalis K. Titsias DeepMind London, UK [email protected] Petros Dellaportas Department of Statistical Science University College of London, UK Department of Statistics, Athens Univ. of Econ. and Business, Greece and The Alan Turing Institute, UK

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

## 1 Introduction

Markov chain Monte Carlo (MCMC) is a family of algorithms that provide a mechanism for generating dependent draws from arbitrarily complex distributions. The basic set up of an MCMC algorithm in any probabilistic (e.g. Bayesian) inference problem, with an intractable target density $\pi (x)$, is as follows. A discrete time Markov chain ${\{{X}_{t}\}}_{t=0}^{\mathrm{\infty}}$ with transition kernel ${P}_{\theta}$, appropriately chosen from a collection of $\pi $-invariant kernels ${\{{P}_{\theta}(\cdot ,\cdot )\}}_{\theta \in \mathrm{\Theta}}$, is generated and the ergodic averages ${\mu}_{t}(F)={t}^{-1}{\sum}_{i=0}^{t-1}F({X}_{i})$ are used as approximations to ${E}_{\pi}(F)$ for any real-valued function $F$. Although in principle this sampling setup is simple, the actual implementation of any MCMC algorithm requires careful choice of ${P}_{\theta}$ because the properties of ${\mu}_{t}$ depend on $\theta $. In common kernels that lead to random walk Metropolis (RWM), Metropolis-adjusted Langevin (MALA) or Hamiltonian Monte Carlo (HMC) algorithms the kernels ${P}_{\theta}$ are specified through an accept-reject mechanism in which the chain moves from time $t$ to time $t+1$ by first proposing candidate values $y$ from a density ${q}_{\theta}(y|x)$ and accepting them with some probability $\alpha ({x}_{t},y)$ and setting ${x}_{t+1}=y$, or rejecting them and setting ${x}_{t+1}={x}_{t}$. Since $\theta $ directly affects this acceptance probability, it is clear that one should choose $\theta $ so that the chain does not move too slowly or rejects too many proposed values $y$ because in both these cases convergence to the stationary distribution will be slow. This has been recognised as early as in (metropolis1953equation, ) and has initiated exciting research that has produced optimum average acceptance probabilities for a series of algorithms; see (roberts1997weak, ; roberts1998optimal, ; roberts2001optimal, ; haario2005componentwise, ; bedard2007weak, ; bedard2008optimal, ; roberts2009examples, ; bedard2008efficient, ; rosenthal2011optimal, ; beskos2013optimal, ). Such optimal average acceptance probabilities provide basic guidelines for adapting single step size parameters to achieve certain average acceptance rates.

More sophisticated adaptive MCMC algorithms that can learn a full set of parameters $\theta $, such as a covariance matrix, borrow information from the history of the chain to optimise some criterion reflecting the performance of the Markov chain (haario2001adaptive, ; atchade2005adaptive, ; roberts2007coupling, ; giordani2010adaptive, ; andrieu2006ergodicity, ; andrieu2007efficiency, ; atchade2009adaptive, ). Such methods are typically non gradient-based and the basic strategy they use is to sequentially fit the proposal ${q}_{\theta}(y|x)$ to the history of states ${x}_{t-1},{x}_{t},\mathrm{\dots},$ by ignoring also the rejected state values. This can result in very slow adaptation because the initial Markov chain simulations are based on poor initial $\theta $ and the generated states, from which $\theta $ is learnt, are highly correlated and far from the target. The authors in roberts2009examples call such adaptive strategies ‘greedy’ in the sense that they try to adapt too closely to initial information from the output and take considerable time to recover from misleading initial information.

In this paper, we develop faster and more robust gradient-based adaptive MCMC algorithms that make use of the gradient of the target, $\nabla \mathrm{log}\pi (x)$, and they learn from both actual states of the chain and proposed (and possibly rejected) states. The key idea is to define and maximise w.r.t. $\theta $ an entropy regularised objective function that promotes high acceptance rates and high values for the entropy of the proposal distribution. This objective function, referred to as generalised speed measure, is inspired by the speed measure of the infinite-dimensional limiting diffusion process that captures the notion of speed in which a Markov chain converges to its stationary distribution (roberts2001optimal, ). We maximise this objective function by applying stochastic gradient variational inference techniques such as those based on the reparametrisation trick (Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ). An advantage of our algorithm compared to traditional adaptive MCMC methods is that the adaptation occurs even when candidate state values are rejected. In fact, the adaptation can be faster when candidate values $y$ are rejected since then we make always full use of the gradient $\nabla \mathrm{log}\pi (y)$ evaluated at the rejected $y$. This allows the adaptation to start in early iterations even if the initial proposal distribution is far from optimum and the chain is not moving. We apply the method for learning multivariate RWM and MALA proposals where we adapt full covariance matrices parametrised efficiently using Cholesky factors. In the experiments we demonstrate our algorithms to multivariate Gaussian targets and Bayesian logistic regression and empirically show that they outperform several other baselines, including advanced HMC schemes.

## 2 Gradient-based adaptive MCMC

Assume a target distribution $\pi (x)$, known up to some unknown normalising constant, where $x\in {\mathbb{R}}^{n}$ is the state vector. To sample from $\pi (x)$ we consider the Metropolis-Hastings (M-H) algorithm that generates new states by sampling from a proposal distribution ${q}_{\theta}(y|x)$, that depends on parameters $\theta $, and accepts or rejects each proposed state by using the standard M-H acceptance probability

$$\alpha (x,y;\theta )=\text{min}\{1,\frac{\pi (y){q}_{\theta}(x|y)}{\pi (x){q}_{\theta}(y|x)}\}.$$ | (1) |

While the M-H algorithm defines a Markov chain that converges to the target distribution, the efficiency of the algorithm heavily depends on the choice of the proposal distribution ${q}_{\theta}(x|y)$ and the setting of its parameters $\theta $.

Here, we develop a framework for stochastic gradient-based adaptation or learning of ${q}_{\theta}(x|y)$ that maximises an objective function inspired by the concept of speed measure that underlies the theoretical foundations of MCMC optimal tuning (roberts1997weak, ; roberts1998optimal, ). Given that the chain is at state $x$ we would like: (i) to propose big jumps in the state space and (ii) accept these jumps with high probability. By assuming for now that the proposal has the standard random walk isotropic form, such that ${q}_{\sigma}(y|x)=\mathcal{N}(y|x,{\sigma}^{2}I)$, the speed measure is defined as

$${s}_{\sigma}(x)={\sigma}^{2}\times \alpha (x;\sigma ),$$ | (2) |

where ${\sigma}^{2}$ denotes the variance, also called step size, of the proposal distribution, while $\alpha (x;\sigma )$ is the average acceptance probability when starting at $x$, i.e. $\alpha (x;\sigma )=\int \alpha (x,y;\sigma ){q}_{\sigma}(y|x)\mathit{d}y$. To learn a good value for the step size we could maximise the speed measure in Eq. 2, which intuitively promotes high variance for the proposal distribution together with high acceptance rates. In the theory of optimal MCMC tuning, ${s}_{\sigma}(x)$ is averaged under the stationary distribution $\pi (x)$ to obtain a global speed measure value ${s}_{\sigma}=\int \pi (x){s}_{\sigma}(x)\mathit{d}x$. For simple targets and with increasing dimension this measure is maximised so that ${\sigma}^{2}$ is set to a value that leads to the acceptance probability $0.234$ (roberts1997weak, ; roberts1998optimal, ). This subsequently leads to the popular heuristic for tuning random walk proposals: tune ${\sigma}^{2}$ so that on average the proposed states are accepted with probability $1/4$. Similar heuristics have been obtained for tuning the step sizes of more advanced schemes such as MALA and HMC, where 0.54 is considered optimal for MALA (roberts2001optimal, ) and 0.67 for HMC (neal2010, ; beskos2013optimal, ).

While the current notion of speed measure from Eq. 2 is intuitive, it is only suitable for tuning proposals having a single step size. Thus, in order to learn arbitrary proposal distributions ${q}_{\theta}(y|x)$, where $\theta $ is a vector of parameters, we need to define suitable generalisations of the speed measure. Suppose, for instance, that we wish to tune a Gaussian with a full covariance matrix, i.e. ${q}_{\mathrm{\Sigma}}(y|x)=\mathcal{N}(y|x,\mathrm{\Sigma})$. To achieve this we need to generalise the objective in Eq. 2 by replacing ${\sigma}^{2}$ with some functional $\mathcal{F}(\mathrm{\Sigma})$ that depends on the full covariance. An obvious choice is to consider the average distance ${||y-x||}^{2}$ given by the trace $\text{tr}(\mathrm{\Sigma})={\sum}_{i}{\sigma}_{i}^{2}$. However, this is problematic since it can lead to learning proposals with very poor mixing. To see this note that since the trace is the sum of variances it can obtain high values even when some of the components of $x$ have very low variance, e.g. for some ${x}_{i}$ it holds ${\sigma}_{i}^{2}\approx 0$, which can result in very low sampling efficiency or even non-ergodicity. In order to define better functionals $\mathcal{F}(\mathrm{\Sigma})$ we wish to exploit the intuition that for MCMC all components of $x$ need to jointly perform (relative to their scale) big jumps, a requirement that is better captured by the determinant $|\mathrm{\Sigma}|$ or more generally by the entropy of the proposal distribution.

Therefore, we introduce a generalisation of the speed measure having the form,

$${s}_{\theta}(x)=\mathrm{exp}\{\beta {\mathscr{H}}_{{q}_{\theta}(y|x)}\}\times \alpha (x;\theta )=\mathrm{exp}\{\beta {\mathscr{H}}_{{q}_{\theta}(y|x)}\}\times \int \alpha (x,y;\theta ){q}_{\theta}(y|x)\mathit{d}y,$$ | (3) |

where ${\mathscr{H}}_{{q}_{\theta}(y|x)}=-\int {q}_{\theta}(y|x)\mathrm{log}{q}_{\theta}(y|x)\mathit{d}y$
denotes the entropy of the proposal distribution and $\beta >0$ is an hyperparameter. Note that when the
proposal distribution is a full Gaussian ${q}_{\mathrm{\Sigma}}(y|x)=\mathcal{N}(y|x,\mathrm{\Sigma})$
then $\mathrm{exp}\{\beta {\mathscr{H}}_{q(y|x)}\}=\text{const}\times {|\mathrm{\Sigma}|}^{\frac{\beta}{2}}$ which depends
on the determinant of $\mathrm{\Sigma}$. ${s}_{\theta}(x)$, referred to as *generalised speed measure*,
trades off between high entropy of the proposal distribution and high acceptance probability.
The hyperparameter $\beta $ plays the crucial role of balancing the relative
strengths of these terms. As discussed in the next section we can efficiently optimise
$\beta $ in order to achieve a desirable average acceptance rate.

In the following we make use of the above generalised speed measure to derive a variational learning algorithm for adapting the parameters $\theta $ using stochastic gradient-based optimisation.

### 2.1 Maximising the generalised speed measure using variational inference

During MCMC iterations we collect the pairs of vectors ${({x}_{t},{y}_{t})}_{t>0}$ where ${x}_{t}$ is the state of the chain at time $t$ and ${y}_{t}$ the corresponding proposed next state, which if accepted then ${x}_{t+1}={y}_{t}$. When the chain has converged each ${x}_{t}$ follows the stationary distribution $\pi (x)$, otherwise it follows some distribution that progressively converges to $\pi (x)$. In either case we view the sequence of pairs $({x}_{t},{y}_{t})$ as non-iid data based on which we wish to perform gradient-based updates of the parameters $\theta $. In practice such updates can be performed with diminishing learning rates, or more safely completely stop after some number of burn-in iterations to ensure convergence. Specifically, given the current state ${x}_{t}$ we wish to take a step towards maximising ${s}_{\theta}({x}_{t})$ in Eq. 3 or equivalently its logarithm,

$$\mathrm{log}{s}_{\theta}({x}_{t})=\mathrm{log}\int \alpha (x,y;\theta ){q}_{\theta}(y|{x}_{t})\mathit{d}y+\beta {\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})}.$$ | (4) |

The second term is just the entropy of the proposal distribution, which typically will be analytically tractable, while the first term involves an intractable integral. To approximate the first term we work similarly to variational inference (Jordan1999learn, ; Bishop2006, ) and we lower bound it using Jensen’s inequality,

$\mathrm{log}{s}_{\theta}({x}_{t})\ge {\mathcal{F}}_{\theta}({x}_{t})$ | $={\displaystyle \int {q}_{\theta}(y|{x}_{t})\mathrm{log}\text{min}\{1,\frac{\pi (y){q}_{\theta}({x}_{t}|y)}{\pi ({x}_{t}){q}_{\theta}(y|{x}_{t})}\}\mathit{d}y}+\beta {\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})}$ | (5) | ||

$={\displaystyle \int {q}_{\theta}(y|{x}_{t})\text{min}\{0,\mathrm{log}\frac{\pi (y)}{\pi ({x}_{t})}+\mathrm{log}\frac{{q}_{\theta}({x}_{t}|y)}{{q}_{\theta}(y|{x}_{t})}\}\mathit{d}y}+\beta {\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})}.$ | (6) |

To take a step towards maximising ${\mathcal{F}}_{\theta}$ we can apply standard stochastic variational inference techniques such as the score function method or the reparametrisation trick (Carbonetto2009, ; Paisley2012, ; Ranganath2014, ; Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ; Kucukelbir2017, ). Here, we focus on the case ${q}_{\theta}(y|{x}_{t})$ is a reparametrisable distribution such that $y={\mathcal{T}}_{\theta}({x}_{t},\u03f5)$ where ${\mathcal{T}}_{\theta}$ is a deterministic transformation and $\u03f5\sim p(\u03f5)$. ${\mathcal{F}}_{\theta}({x}_{t})$ can be re-written as

${\mathcal{F}}_{\theta}({x}_{t})$ | $={\displaystyle \int p(\u03f5)\text{min}\{0,\mathrm{log}\frac{\pi ({\mathcal{T}}_{\theta}({x}_{t},\u03f5))}{\pi ({x}_{t})}+\mathrm{log}\frac{{q}_{\theta}({x}_{t}|{\mathcal{T}}_{\theta}({x}_{t},\u03f5))}{{q}_{\theta}({\mathcal{T}}_{\theta}({x}_{t},\u03f5)|{x}_{t})}\}\mathit{d}\u03f5}+\beta {\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})}.$ |

Since MCMC at the $t$-th iteration proposes a specific state ${y}_{t}$ constructed as ${\u03f5}_{t}\sim p({\u03f5}_{t})$, ${y}_{t}={\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t})$, an unbiased estimate of the exact gradient ${\nabla}_{\theta}{\mathcal{F}}_{\theta}({x}_{t})$ can be obtained by

${\nabla}_{\theta}{\mathcal{F}}_{\theta}({x}_{t},{\u03f5}_{t})={\nabla}_{\theta}\text{min}\{0,\mathrm{log}{\displaystyle \frac{\pi ({\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t}))}{\pi ({x}_{t})}}+\mathrm{log}{\displaystyle \frac{{q}_{\theta}({x}_{t}|{\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t}))}{{q}_{\theta}({\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t})|{x}_{t})}}\}+\beta {\nabla}_{\theta}{\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})},$ |

which is used to make a gradient update for the parameters $\theta $. Note that the first term in the stochastic gradient is analogous to differentiating through a rectified linear hidden unit (ReLu) in neural networks, i.e. if $\mathrm{log}\frac{\pi ({y}_{t})}{\pi ({x}_{t})}+\mathrm{log}\frac{{q}_{\theta}({x}_{t}|{y}_{t})}{{q}_{\theta}({y}_{t}|{x}_{t})}\ge 0$ the gradient is zero (this is the case when ${y}_{t}$ is accepted with probability one), while otherwise the gradient of the first term reduces to

$${\nabla}_{\theta}\mathrm{log}\pi ({\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t}))+{\nabla}_{\theta}\mathrm{log}\frac{{q}_{\theta}({x}_{t}|{\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t}))}{{q}_{\theta}({\mathcal{T}}_{\theta}({x}_{t},{\u03f5}_{t})|{x}_{t})}.$$ |

The value of the hyperparameter $\beta $ can allow to trade off between large acceptance probability and large entropy of the proposal distribution. Such hyperparameter cannot be optimised by maximising the variational objective ${\mathcal{F}}_{\theta}$ (this typically will set $\beta $ to a very small value so that the acceptance probability becomes close to one but the chain is not moving since the entropy is very low). Thus, $\beta $ needs to be updated in order to control the average acceptance probability of the chain in order to achieve a certain desired value ${\alpha}_{*}$. The value of ${\alpha}_{*}$ can be determined based on the specific MCMC proposal we are using and by following standard recommendations in the literature, as discussed also in the previous section. For instance, when we use RWM ${\alpha}_{*}$ can be set to $1/4$ (see Section 2.2), while for gradient-based MALA (see Section 2.3) ${\alpha}_{*}$ can be set to $0.55$.

Pseudocode for the general procedure is given by Algorithm 1. We set the learning rate ${\rho}_{t}$ using RMSProp (Tieleman2012, ); at each iteration $t$ we set ${\rho}_{t}=\eta /(1+\sqrt{{G}_{t}})$, where $\eta $ is the baseline learning rate, and the updates of ${G}_{t}$ depend on the gradient estimate ${\nabla}_{\theta}{\mathcal{F}}_{\theta}({x}_{t},{\u03f5}_{t})$ as ${G}_{t}=0.9{G}_{t}+0.1{\left[{\nabla}_{\theta}{\mathcal{F}}_{\theta}({x}_{t},{\u03f5}_{t})\right]}^{2}$.

### 2.2 Fitting a full covariance Gaussian random walk proposal

We now specialise to the case the proposal distribution is a random walk Gaussian ${q}_{L}(y|x)=\mathcal{N}(y|x,L{L}^{\top})$ where the parameter $L$ is a positive definite lower triangular matrix, i.e. a Cholesky factor. This distribution is reparametrisable since $y\equiv {\mathcal{T}}_{L}(x,\u03f5)=x+L\u03f5$, $\u03f5\sim \mathcal{N}(\u03f5|0,I)$. At the $t$-th iteration when the state is ${x}_{t}$ the lower bound becomes

${\mathcal{F}}_{L}({x}_{t})$ | $={\displaystyle \int \mathcal{N}(\u03f5|0,I)\text{min}\{0,\mathrm{log}\pi ({x}_{t}+L\u03f5)-\mathrm{log}\pi ({x}_{t})\}\mathit{d}\u03f5}+\beta {\displaystyle \sum _{i=1}^{n}}\mathrm{log}{L}_{ii}+\text{const}.$ | (7) |

Here, the proposal distribution has cancelled out from the M-H ratio, since it is symmetric, while ${\mathscr{H}}_{{q}_{\theta}(y|{x}_{t})}=\mathrm{log}|L|+\text{const}$ and $\mathrm{log}|L|={\sum}_{i=1}^{n}\mathrm{log}{L}_{ii}$. By making use of the MCMC proposed state ${y}_{t}={x}_{t}+L{\u03f5}_{t}$ we can obtain an unbiased estimate of the exact gradient ${\nabla}_{L}{\mathcal{F}}_{L}({x}_{t})$,

$$ |

where ${y}_{t}={x}_{t}+L{\u03f5}_{t}$, the operation ${[A]}_{lower}$ zeros the upper triangular part (above the main diagonal) of a squared matrix and $\text{diag}(\frac{1}{{L}_{11}},\mathrm{\dots},\frac{1}{{L}_{nn}})$ is a diagonal matrix with elements $1/{L}_{ii}$. The first case of this gradient, i.e. when $$, has a very similar structure with the stochastic reparametrisation gradient when fitting a variational Gaussian approximation (Kingma2014, ; Rezende2014, ; Titsias2014_doubly, ) with the difference that here we centre the corresponding approximation, i.e. the proposal ${q}_{L}({y}_{t}|{x}_{t})$, at the current state ${x}_{t}$ instead of having a global variational mean parameter. Interestingly, this first case when MCMC rejects many samples (or even it gets stuck at the same value for long time) is when learning can be faster since the term ${\nabla}_{{y}_{t}}\mathrm{log}\pi ({y}_{t})\times {\u03f5}_{t}^{\top}$ transfers information about the curvature of the target to the covariance of the proposal. When we start getting high acceptance rates the second case, i.e. $\mathrm{log}\pi ({y}_{t})\ge \mathrm{log}\pi ({x}_{t})$, will often be activated so that the gradient will often reduce to only having the term $\beta \text{diag}(\frac{1}{{L}_{11}},\mathrm{\dots},\frac{1}{{L}_{nn}})$ that encourages the entropy of the proposal to become large. The ability to learn from rejections is in sharp contrast with the traditional non gradient-based adaptive MCMC methods which can become very slow when MCMC has high rejection rates. This is because these methods typically learn from the history of state vectors ${x}_{t}$ by ignoring the information from the rejected states. The algorithm for learning the full random walk Gaussian follows precisely the general structure of Algorithm 1. For the average acceptance rate ${\alpha}_{*}$ we use the value $1/4$.

### 2.3 Fitting a full covariance MALA proposal

Here, we specialise to a full covariance, also called preconditioned, MALA of the form ${q}_{L}(y|x)=\mathcal{N}(y|x+(1/2)L{L}^{\top}{\nabla}_{x}\mathrm{log}\pi (x),L{L}^{\top})$ where the covariance matrix is parametrised by the Cholesky factor $L$. Again this distribution is reparametrisable according to $y\equiv {\mathcal{T}}_{L}(x,\u03f5)=x+(1/2)L{L}^{\top}\nabla \mathrm{log}\pi (x)+L\u03f5$, $\u03f5\sim \mathcal{N}(\u03f5|0,I)$. At the $t$-th iteration when the state is ${x}_{t}$ the reparametrised lower bound simplifies significantly and reduces to,

${\mathcal{F}}_{L}({x}_{t})$ | $={\displaystyle \int}\mathcal{N}(\u03f5|0,I)\text{min}\{0,\mathrm{log}\pi ({x}_{t}+(1/2)L{L}^{\top}\nabla \mathrm{log}\pi ({x}_{t})+L\u03f5)-\mathrm{log}\pi ({x}_{t}).$ | ||

$.$ | $-{\displaystyle \frac{1}{2}}(||(1/2){L}^{\top}[\nabla \mathrm{log}\pi ({x}_{t})+\nabla \mathrm{log}\pi (y)]+\u03f5|{|}^{2}-||\u03f5|{|}^{2})\}d\u03f5+\beta {\displaystyle \sum}{}_{i=1}{}^{n}\mathrm{log}L{}_{ii}+\text{const},$ |

where $||\cdot ||$ denotes Euclidean norm and in the term $\nabla \mathrm{log}\pi (y)$, $y={x}_{t}+(1/2)L{L}^{\top}\nabla \mathrm{log}\pi ({x}_{t})+L\u03f5$. Then, based on the proposed state ${y}_{t}={\mathcal{T}}_{L}({x}_{t},{\u03f5}_{t})$ we can obtain the unbiased gradient estimate $\nabla {\mathcal{F}}_{L}({x}_{t},{\u03f5}_{t})$ similarly to the previous section. In general, such an estimate can be very expensive because the existence of $L$ inside $\nabla \mathrm{log}\pi ({y}_{t})$ means that we need to compute the matrix of second derivatives or Hessian $\nabla \nabla \mathrm{log}\pi ({y}_{t})$. We have found that an alternative procedure which stops the gradient inside $\nabla \mathrm{log}\pi ({y}_{t})$ (i.e. it views $\nabla \mathrm{log}\pi ({y}_{t})$ as a constant w.r.t. $L$) has small bias and works well in practice. In fact, as we will show in the experiments this approximation not only is computationally much faster but remarkably also it leads to better adaptation compared to the exact Hessian procedure, presumably because by not accounting for the gradient inside $\nabla \mathrm{log}\pi ({y}_{t})$ reduces the variance. Furthermore, the expression of the gradient w.r.t. $L$ used by this fast approximation can be computed very efficiently with a single $O({n}^{2})$ operation (an outer vector product; see Supplement), while each iteration of the algorithm requires overall at most four $O({n}^{2})$ operations. For these gradient-based adaptive MALA schemes, $\beta $ in Algorithm 1 is adapted to obtain an average acceptance rate roughly ${\alpha}_{*}=0.55$.

## 3 Related Work

Connection of our method with traditional adaptive MCMC methods has been discussed in Section 1. Here, we analyse additional related works that make use of gradient-based optimisation and specialised objective functions or algorithms to train MCMC proposal distributions.

The work in levy2018generalizing proposed a criterion to tune MCMC proposals
based on maximising a modified version of the expected squared jumped distance,
$\int {q}_{\theta}(y|{x}_{t}){||y-{x}_{t}||}^{2}\alpha ({x}_{t},y;\theta )\mathit{d}y$, previously considered in pasarica2010adaptively .
Specifically, the authors in levy2018generalizing firstly observe that the expected squared jumped distance
may not encourage mixing across all dimensions of $x$^{1}^{1}
1
Because
the additive form of ${||y-{x}_{t}||}^{2}={\sum}_{i}{({y}_{i}-{x}_{ti})}^{2}$ implies that even when some dimensions
might not be moving at all (the corresponding distance terms are zero),
the overall sum can still be large. and then try to resolve this
by including a reciprocal term (see Section 4.2 in their paper). The
generalised speed measure proposed in this paper is rather different from such criteria
since it encourages joint exploration of all dimensions of $x$ by applying maximum entropy
regularisation, which by construction penalises "dimensions that do not move" since the entropy becomes minus infinity in such cases.
Another important difference is that in our method the optimisation
is performed in the log space by propagating gradients through the logarithm of the
M-H acceptance probability, i.e. through $\mathrm{log}\alpha ({x}_{t},y;\theta )$ and not through $\alpha ({x}_{t},y;\theta )$. This is exactly
analogous to other numerically stable objectives such as variational lower bounds and log likelihoods, and as those
our method leads to numerically stable optimisation for arbitrarily large dimensionality of $x$
and complex targets $\pi (x)$.

In another related work, the authors in neklyudov2018metropolis considered minimising the KL divergence $\text{KL}[\pi ({x}_{t}){q}_{\theta}({y}_{t}|{x}_{t})||\pi ({y}_{t}){q}_{\theta}({x}_{t}|{y}_{t})]$. However, this loss for standard proposal schemes, such as RWM and MALA, leads to degenerate deterministic solutions where ${q}_{\theta}({y}_{t}|{x}_{t})$ collapses to a delta function. Therefore, neklyudov2018metropolis maximised this objective for the independent M-H sampler where the collapsing problem does not occur. The entropy regularised objective we introduced is different and it can adapt arbitrary MCMC proposal distributions, and not just the independent M-H sampler.

There has been also work to learn flexible MCMC proposals using neural networks (song2017nice, ; levy2018generalizing, ; habib2018auxiliary, ; Salimans2015, ). For instance, song2017nice use volume preserving flows and an adversarial objective, levy2018generalizing use the modified expected jumped distance, discussed earlier, to learn neural network-based extensions of HMC, while habib2018auxiliary ; Salimans2015 use auxiliary variational inference. The need to train neural networks can add a significant computational cost, and from the end-user point of view these neural adaptive samplers might be hard to tune especially in high dimensions. Notice that the generalised speed measure we proposed in this paper could possibly be used to train neural adaptive samplers as well. However, to really obtain practical algorithms we need to ensure that training has small cost that does not overwhelm the possible benefits in terms of effective sample size.

Finally, the generalised speed measure that is based on entropy regularisation shares similarities with other used objectives for learning probability distributions, such as in variational Bayesian inference, where the variational lower bound includes an entropy term (Jordan1999learn, ; Bishop2006, ) and reinforcement learning (RL) where maximum-entropy regularised policy gradients are able to estimate more explorative policies (schulman15, ; mniha16, ). Further discussion on the resemblance of our algorithm with RL is given in the Supplement.

## 4 Experiments

We test the gradient-based adaptive MCMC methods in several simulated and real data.
We investigate the performance of two instances of the framework:
the gradient-based adaptive random walk (gadRWM)
detailed in Section 2.2 and the corresponding MALA (gadMALA)
detailed in Section 2.3. For gadMALA we consider
the exact reparametrisation method that requires the evaluation of the Hessian (gadMALAe)
and the fast approximate variant (gadMALAf).
These schemes are compared against: (i) standard random walk Metropolis (RWM)
with proposal $\mathcal{N}(y|x,{\sigma}^{2}I)$, (ii) an adaptive MCMC
(AM) that fits a proposal of the form $\mathcal{N}(y|x,\mathrm{\Sigma})$ (we consider
a computational efficient version based on updating the Cholesky factor of
$\mathrm{\Sigma}$; see Supplement), (iii) a standard MALA proposal $\mathcal{N}(y|x+(1/2){\sigma}^{2}\nabla \mathrm{log}\pi (x),{\sigma}^{2}I)$,
(iv) an Hamiltonian Monte Carlo (HMC) with a fixed number of leap frog steps (either 5, or 10, or 20)
(v) and the state of the art no-U-turn sampler (NUTS) (hoffman2014no, ) which arguably is the most efficient adaptive HMC method
that automatically determines the number of leap frog steps.
We provide our own MALTAB implementation^{2}^{2}
2
https://github.com/mtitsias/gadMCMC. of all algorithms, apart from NUTS for which we consider a publicly
available implementation.

### 4.1 Illustrative experiments

To visually illustrate the gradient-based adaptive samplers we consider a correlated 2-D Gaussian target with covariance matrix $\mathrm{\Sigma}=[\mathrm{1\hspace{0.25em}0.99};\mathrm{0.99\hspace{0.25em}1}]$ and a 51-dimensional Gaussian target obtained by evaluating the squared exponential kernel plus small white noise, i.e. $k({x}_{i},{x}_{j})=\mathrm{exp}\{-\frac{1}{2}\frac{{({x}_{i}-{x}_{j})}^{2}}{0.16}\}+0.01{\delta}_{i,j}$, on the regular grid $[0,4]$. The first two panels in Figure 1 show the true covariance together with the adapted covariances obtained by gadRWM for two different settings of the average acceptance rate ${\alpha}_{*}$ in Algorithm 1, which illustrates also the adaptation of the entropy-regularisation hyperparameter $\beta $ that is learnt to obtain a certain ${\alpha}_{*}$. The remaining two plots illustrate the ability to learn a highly correlated $51$-dimensional covariance matrix (with eigenvalues ranging from $0.01$ to $12.07$) by applying our most advanced gadMALAf scheme.

### 4.2 Quantitative results

Here, we compare all algorithms in some standard benchmark problems, such as Bayesian logistic regression, and report effective sample size (ESS) together with other quantitative scores.

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

(Neal’s Gaussian) | ||||

gadMALAf | 8.7 | 0.556 | (1413.4, 1987.4, 2580.8) | 161.70 (15.07) |

gadMALAe | 10.0 | 0.541 | (922.2, 2006.3, 2691.1) | 92.34 (7.11) |

gadRWM | 7.0 | 0.254 | (27.5, 66.9, 126.9) | 3.95 (0.66) |

AM | 2.3 | 0.257 | (8.7, 48.6, 829.1) | 3.71 (0.87) |

RWM | 2.2 | 0.261 | (2.9, 8.4, 2547.6) | 1.31 (0.06) |

MALA | 3.1 | 0.530 | (2.9, 10.0, 12489.2) | 0.95 (0.03) |

HMC-20 | 49.7 | 0.694 | (306.1, 1537.8, 19732.4) | 6.17 (3.35) |

NUTS | 360.5 | >0.7 | (18479.6, 20000.0, 20000.0) | 51.28 (1.64) |

(Caravan) | ||||

gadMALAf | 23.1 | 0.621 | (228.1, 750.3, 1114.7) | 9.94 (2.64) |

gadMALAe | 95.1 | 0.494 | (66.6, 508.3, 1442.7) | 0.70 (0.16) |

gadRWM | 22.6 | 0.234 | (5.3, 34.3, 104.5) | 0.23 (0.06) |

AM | 20.0 | 0.257 | (3.2, 11.8, 62.5) | 0.16 (0.01) |

RWM | 15.3 | 0.242 | (3.0, 9.3, 52.5) | 0.20 (0.03) |

MALA | 22.8 | 0.543 | (4.4, 28.3, 326.0) | 0.19 (0.05) |

HMC-10 | 225.5 | 0.711 | (248.3, 2415.7, 19778.7) | 1.10 (0.12) |

NUTS | 1412.1 | >0.7 | (7469.5, 20000.0, 20000.0) | 5.29 (0.38) |

Experimental settings. In all experiments for AM and gradient-based adaptive schemes the Cholesky factor $L$ was initialised to a diagonal matrix with values $0.1/\sqrt{n}$ in the diagonal where $n$ is the dimensionality of $x$. For the AM algorithm the learning rate was set to $0.001/(1+t/T)$ where $t$ is the number of iterations and $T$ (the value $4000$ was used in all experiments) serves to keep the learning rate nearly constant for the first $T$ training iterations. For the gradient-based adaptive algorithms we use RMSprop (see Section 2.1) where $\eta $ was set to $0.00005$ for gadRWM and to $0.00015$ for the gadMALA schemes. NUTS uses its own fully automatic adaptive procedure that determines both the step size and the number of leap frog steps (hoffman2014no, ). For all experiments and algorithms (apart from NUTS) we consider $2\times {10}^{4}$ burn-in iterations and $2\times {10}^{4}$ iterations for collecting samples. This adaptation of $L$ or ${\sigma}^{2}$ takes place only during the burn-in iterations and then it stops, i.e. at collection of samples stage these parameters are kept fixed. For NUTS, which has its own internal tuning procedure, $500$ burn-in iterations are sufficient before collecting $2\times {10}^{4}$ samples. The computational time for all algorithms reported in the tables corresponds to the overall running time, i.e. the time for performing jointly all burn-in and collection of samples iterations.

Neal’s Gaussian target. We first consider an example used in (neal2010, ) where the target is a zero-mean multivariate Gaussian with diagonal covariance matrix $\mathrm{\Sigma}=\text{diag}({s}_{1}^{2},\mathrm{\dots},{s}_{100}^{2})$ where the stds ${s}_{i}$ take values in the linear grid $0.01,0.02,\mathrm{\dots},1$. This is a challenging example because the different scaling of the dimensions means that the schemes that use an isotropic step ${\sigma}^{2}$ will be adapted to the smallest dimension ${x}_{1}$ while the chain at the higher dimensions, such as ${x}_{100}$, will be moving slowly exhibiting high autocorrelation and small effective sample size. The first row of Figure 3 shows the trace plot across iterations of the dimension ${x}_{100}$ for some of the adaptive schemes including an HMC scheme that uses $20$ leap frog steps. Clearly, the gradient-based adaptive methods show much smaller autocorrelation that AM. This is because they achieve a more efficient adaptation of the Cholesky factor $L$ which ideally should become proportional to a diagonal matrix with the linear grid $0.01,0.02,\mathrm{\dots},1$ in the main diagonal. The second row of Figure 3 shows the diagonal elements of $L$ from which we can observe that all gradient-based schemes lead to more accurate adaptation with gadMALAf being the most accurate.

Furthermore, Table 1 provides quantitative results such as minimum, median and maximum ESS computed across all dimensions of the state vector $x$, running times and an overall efficiency score Min ESS/s (i.e. ESS for the slowest moving component of $x$ divided by running time – last column in the Table) which allows to rank the different algorithms. All results are averages after repeating the simulations $10$ times under different random initialisations. From the table it is clear that the gadMALA algorithms give the best performance with gadMALAf being overall the most effective.

Bayesian logistic regression. We consider binary classification where given a set of training examples ${\{{y}_{i},{s}_{i}\}}_{i=1}^{n}$ we assume a logistic regression likelihood $p(y|w,s)={\sum}_{i=1}^{n}{y}_{i}\mathrm{log}\sigma ({s}_{i})+(1-{y}_{i})\mathrm{log}(1-\sigma ({s}_{i}))$, where $\sigma ({s}_{i})=1/(1+\mathrm{exp}(-{w}^{\top}{s}_{i}))$, ${s}_{i}$ is the input vector and $w$ the parameters. We place a Gaussian prior on $w$ and we wish to sample from the posterior distribution over $w$. We considered six binary classification datasets (Australian Credit, Heart, Pima Indian, Ripley, German Credit and Caravan) with a number of examples ranging from $n=250$ to $n=5822$ and dimensionality of the state/parameter vector ranging from $3$ to $87$. Table 1 shows results for the most challenging Caravan dataset where the dimensionality of $w$ is $87$, while the remaining five tables are given in the Supplement. From all tables we observe that the gadMALAf is the most effective and it outperforms all other methods. While NUTS has always very high ESS is still outperformed by gadMALAf because of the high computational cost, i.e. NUTS might need to use a very large number of leap frog steps (each requiring re-evaluating the gradient of the log target) per iteration. Further results, including a higher $785$-dimensional example on MNIST, are given in the Supplement.

## 5 Conclusion

We have presented a new framework for gradient-based adaptive MCMC that makes use of an entropy-regularised objective function that generalises the concept of speed measure. We have applied this method for learning RWM and MALA proposals with full covariance matrices.

Some topics for future research are the following. Firstly, to deal with very high dimensional spaces it would be useful to consider low rank parametrisations of the covariance matrices in RWM and MALA proposals. Secondly, it would be interesting to investigate whether our method can be used to tune the so-called mass matrix in HMC samplers. However, in order for this to lead to practical and scalable algorithms we have to come up with schemes that avoid the computation of the Hessian, as we successfully have done for MALA. Finally, in order to reduce the variance of the stochastic gradients and speed up further the adaptation, especially in high dimensions, our framework could be possibly combined with parallel computing as used for instance in deep reinforcement learning espeholt18a .

## References

- [1] Christophe Andrieu and Yves Atchade. On the efficiency of adaptive mcmc algorithms. Electronic Communications in Probability, 12:336–349, 2007.
- [2] Christophe Andrieu and Éric Moulines. On the ergodicity properties of some adaptive mcmc algorithms. The Annals of Applied Probability, 16(3):1462–1505, 2006.
- [3] Christophe Andrieu and Johannes Thoms. A tutorial on adaptive mcmc. Statistics and Computing, 18(4):343–373, December 2008.
- [4] Yves Atchade, Gersende Fort, Eric Moulines, and Pierre Priouret. Adaptive markov chain monte carlo: theory and methods. Preprint, 2009.
- [5] Yves F Atchadé and Jeffrey S Rosenthal. On adaptive markov chain monte carlo algorithms. Bernoulli, 11(5):815–828, 2005.
- [6] Mylène Bédard. Weak convergence of metropolis algorithms for non-iid target distributions. The Annals of Applied Probability, pages 1222–1244, 2007.
- [7] Mylène Bédard. Efficient sampling using metropolis algorithms: Applications of optimal scaling results. Journal of Computational and Graphical Statistics, 17(2):312–332, 2008.
- [8] Mylene Bedard. Optimal acceptance rates for metropolis algorithms: Moving beyond 0.234. Stochastic Processes and their Applications, 118(12):2198–2222, 2008.
- [9] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid monte carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
- [10] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
- [11] P. Carbonetto, M. King, and F. Hamze. A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems, 2009.
- [12] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1407–1416, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- [13] Paolo Giordani and Robert Kohn. Adaptive independent metropolis–hastings by fast estimation of mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243–259, 2010.
- [14] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
- [15] Heikki Haario, Eero Saksman, and Johanna Tamminen. Componentwise adaptation for high dimensional mcmc. Computational Statistics, 20(2):265–273, 2005.
- [16] Raza Habib and David Barber. Auxiliary variational mcmc. To appear at ICLR 2019, 2019.
- [17] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
- [18] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, USA, 1999.
- [19] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- [20] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1–45, 2017.
- [21] Daniel Levy, Matt D. Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. In International Conference on Learning Representations, 2018.
- [22] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
- [23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
- [24] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
- [25] Kirill Neklyudov, Pavel Shvechikov, and Dmitry Vetrov. Metropolis-hastings view on variational inference and adversarial training. arXiv preprint arXiv:1810.07151, 2018.
- [26] J. W. Paisley, D. M. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In International Conference on Machine Learning, 2012.
- [27] Cristian Pasarica and Andrew Gelman. Adaptively scaling the metropolis algorithm using expected squared jumped distance. Statistica Sinica, pages 343–364, 2010.
- [28] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
- [29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
- [30] Gareth O Roberts, Andrew Gelman, and Walter R Gilks. Weak convergence and optimal scaling of random walk metropolis algorithms. The annals of applied probability, 7(1):110–120, 1997.
- [31] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
- [32] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-hastings algorithms. Statistical Science, pages 351–367, 2001.
- [33] Gareth O Roberts and Jeffrey S Rosenthal. Coupling and ergodicity of adaptive markov chain monte carlo algorithms. Journal of applied probability, 44(2):458–475, 2007.
- [34] Gareth O Roberts and Jeffrey S Rosenthal. Examples of adaptive mcmc. Journal of Computational and Graphical Statistics, 18(2):349–367, 2009.
- [35] Jeffrey S Rosenthal. Optimal proposal distributions and adaptive mcmc. In Handbook of Markov Chain Monte Carlo, pages 114–132. Chapman and Hall/CRC, 2011.
- [36] T. Salimans, D. P. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, 2015.
- [37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- [38] Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
- [39] T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 4, 2012.
- [40] M. K. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning, 2014.

Supplement:

Gradient-based Adaptive Markov Chain Monte Carlo

## Appendix A Gradient-based adaptive MCMC as Reinforcement Learning

A MDP is a tuple $(\mathcal{X},\mathcal{Y},\mathcal{P},\mathcal{R})$ where $\mathcal{X}$ is the state space, $\mathcal{Y}$ is the action space, $\mathcal{P}$ is the transition distribution with density $p({x}_{t+1}|{x}_{t},{y}_{t})$ that describes how the next state ${x}_{t+1}$ is generated given that currently we are at state ${x}_{t}$ and we take action ${y}_{t}$. Further, the reward function $R({x}_{t},{y}_{t})$ provides some instantaneous or local signal about how good the action ${y}_{t}$ was when being at ${x}_{t}$. Furthermore, in a MDP we have also a policy $\pi ({y}_{t}|{x}_{t})$ which is a distribution over actions given states and it fully describes the behaviour of the agent. Given that we start at ${x}_{0}$ we wish to specify the policy so that to maximise future reward, such as the expected accumulated discounted reward

$${\mathbb{E}}_{\pi}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({x}_{t},{y}_{t})\right],\gamma \in [0,1].$$ |

Suppose now a MCMC procedure targeting $\pi (x)$, where $x\in \mathcal{X}$ is the state vector. Consider a proposal distribution ${q}_{\theta}(y|x)$, such that the standard Metropolis-Hastings algorithm accepts each proposed state ${y}_{t}\sim {q}_{\theta}({y}_{t}|{x}_{t})$ with probability

$$\alpha ({x}_{t},{y}_{t};\theta )=\text{min}(1,\frac{\pi ({y}_{t})}{\pi ({x}_{t})}\frac{{q}_{\theta}({x}_{t}|{y}_{t})}{{q}_{\theta}({y}_{t}|{x}_{t})}),$$ | (8) |

so that ${x}_{t+1}={y}_{t}$, while if the proposal is rejected, ${x}_{t+1}={x}_{t}$. To reformulate MCMC as an MDP we make the following correspondences. Firstly, both the state ${x}_{t}$ and the action ${y}_{t}$ will live in the same space which will be the state space $\mathcal{X}$ of the target distribution. The MCMC proposal ${q}_{\theta}({y}_{t}|{x}_{t})$ will correspond to the policy $\pi ({y}_{t}|{x}_{t})$, while the environmental transition dynamics will be stochastic and given by the two-component mixture,

$$p({x}_{t+1}|{x}_{t},{y}_{t})=\alpha ({x}_{t},{y}_{t};\theta ){\delta}_{{x}_{t+1},{y}_{t}}+(1-\alpha ({x}_{t},{y}_{t};\theta )){\delta}_{{x}_{t+1},{x}_{t}},$$ |

where ${\delta}_{x,y}$ denotes the delta function. This transition density simply says that the new state ${x}_{t+1}$ with probability $\alpha ({x}_{t},{y}_{t};\theta )$ will be equal to the proposed action ${y}_{t}$, while with the remaining probability will be set to the previous state, i.e. ${x}_{t+1}={x}_{t}$. Notice that the standard MCMC transition kernel ${K}_{\theta}({x}_{t},{x}_{t+1})$ is obtained by integrating out the action ${y}_{t}$, i.e.

${K}_{\theta}({x}_{t},{x}_{t+1})$ | $={\displaystyle \int p({x}_{t+1}|{x}_{t},{y}_{t}){q}_{\theta}({y}_{t}|{x}_{t})\mathit{d}{y}_{t}}$ | |||

$=\alpha ({x}_{t},{x}_{t+1})q({x}_{t+1}|{x}_{t})+\left(1-{\displaystyle \int \alpha ({x}_{t},{y}_{t}){q}_{\theta}({y}_{t}|{x}_{t})\mathit{d}{y}_{t}}\right){\delta}_{{x}_{t+1},{x}_{t}}.$ | (9) |

The final ingredient we need to reformulate MCMC as MDP is the reward function $R({x}_{t},{y}_{t})$. The gradient-based adaptive MCMC method essentially assumes as reward

$$R({y}_{t},{x}_{t};\theta )=\mathrm{log}\alpha ({x}_{t},{y}_{t};\theta )-\beta \mathrm{log}{q}_{\theta}({y}_{t}|{x}_{t}),$$ |

which is an entropy-regularised reward that promotes high exploration with the entropic term $-\beta \mathrm{log}{q}_{\theta}({y}_{t}|{x}_{t})$. Gradient-based adaptive MCMC essentially at each step stochastically maximises the expected reward starting from state ${x}_{t}$, i.e.

$$\int {q}_{\theta}({y}_{t}|{x}_{t})R({y}_{t},{x}_{t};\theta )\mathit{d}{y}_{t}.$$ |

While the above reformulates MCMC as a reinforcement learning (RL) problem, there are clearly also some differences with standard RL problems. Given that the reward $R({y}_{t},{x}_{t};\theta )$ is very informative (we are not facing the delayed-reward problem commonly encountered in standard RL) gradient-based MCMC sets $\gamma =0$ in order to maximise immediate reward. Further, the transition dynamics $p({x}_{t+1}|{x}_{t},{y}_{t})$ are known in MCMC, while this typically is not the case in standard RL. Finally, notice that the reward $R({y}_{t},{x}_{t};\theta )$ as well as the transition dynamics $p({x}_{t+1}|{x}_{t},{y}_{t})$ all depend on the parameter $\theta $ and the policy ${q}_{\theta}({y}_{t}|{x}_{t})$, i.e. they depend on the MCMC proposal distribution.

## Appendix B Further details about the algorithms

For the standard adaptive MCMC method (AM) we implemented a computational efficient version that requires no matrix decompositions (which are expensive due to the $O({n}^{3})$ scaling) by parametrising the proposal as $\mathcal{N}(y|x,L{L}^{\top})$ and updating the Cholesky factor in each iteration according to the updates

$$\mu \leftarrow \mu +{\rho}_{t}({x}_{t+1}-\mu ),$$ |

$$L\leftarrow L+{\rho}_{t}L{\left[{L}^{-1}({x}_{t+1}-\mu ){({x}_{t+1}-\mu )}^{\top}{L}^{-\top}-I\right]}_{lower},$$ |

where $\mu $ tracks the global mean of the state vector. Further details about this scheme can be found in Section 5.1.1 in [3].

For our most efficient gadMALAf scheme the stochastic gradient in each iteration is

$\nabla {\mathcal{F}}_{L}({x}_{t},{\u03f5}_{t})$ | $={\nabla}_{L}\text{min}\{0,\mathrm{log}\pi ({x}_{t}+(1/2)L{L}^{\top}\nabla \mathrm{log}\pi ({x}_{t})+L{\u03f5}_{t})-\mathrm{log}\pi ({x}_{t}).$ | ||

$.$ | $-{\displaystyle \frac{1}{2}}(||(1/2){L}^{\top}[\nabla \mathrm{log}\pi ({x}_{t})+\nabla \mathrm{log}\pi ({y}_{t})]+{\u03f5}_{t}|{|}^{2}-||{\u03f5}_{t}|{|}^{2})\}+\beta \nabla {}_{L}{\displaystyle \sum}{}_{i=1}{}^{n}\mathrm{log}L{}_{ii},$ |

where, as discussed in the main paper, $\nabla \mathrm{log}\pi ({y}_{t})$ is taken as constant w.r.t. $L$. Then the gradient of the M-H log ratio (when this log ratio is negative, since otherwise its gradient is zero) simplifies as

${\nabla}_{L}\mathrm{log}\pi \left({x}_{t}+(1/2)L{L}^{\top}\nabla \mathrm{log}\pi ({x}_{t})+L{\u03f5}_{t}\right)-{\displaystyle \frac{1}{2}}{\nabla}_{L}{||(1/2){L}^{\top}[\nabla \mathrm{log}\pi ({x}_{t})+\nabla \mathrm{log}\pi ({y}_{t})]+{\u03f5}_{t}||}^{2}$ | ||

$=-{\displaystyle \frac{1}{2}}\left(\nabla \mathrm{log}\pi ({x}_{t})-\nabla \mathrm{log}\pi ({y}_{t})\right){\left((1/2){L}^{\top}[\nabla \mathrm{log}\pi ({x}_{t})-\nabla \mathrm{log}\pi ({y}_{t})]+{\u03f5}_{t}\right)}^{\top}$ |

and then take the lower triangular part. This is just an outer vector product that scales as $O({n}^{2})$. Overall each iteration of the algorithm can be implemented (plus the extra overhead of a single gradient evaluation $\nabla \mathrm{log}\pi ({y}_{t})$ of the log target at the proposed state ${y}_{t}$) by using at most four $O({n}^{2})$ operations during adaptation and exactly two $O({n}^{2})$ operations after burn-in, as shown in the released code.

## Appendix C Extra results on the Neal’s Gaussian target

Figure 3 shows trace plot for the log density of the target for all different algorithms.

## Appendix D Extra results on the binary classification datasets

Tables 2-6 show the results for the remaining five binary classification datasets not reported in the main article. Figures 4 and 5 show trace plot of log density of the target acrorss all different algorithms for the German Credit and Caravan datasets. For the remaining datasets the corresponding plots are similar.

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 8.1 | 0.569 | (3485.9, 4262.9, 4784.0) | 443.97 (76.13) |

gadMALAe | 13.5 | 0.540 | (3034.9, 4234.3, 4836.3) | 227.61 (29.87) |

gadRWM | 7.7 | 0.253 | (288.0, 423.0, 515.0) | 38.68 (9.53) |

AM | 4.4 | 0.261 | (310.9, 410.1, 507.2) | 70.21 (6.23) |

RWM | 3.4 | 0.252 | (31.3, 312.6, 495.2) | 9.16 (3.12) |

MALA | 7.0 | 0.524 | (138.4, 2388.1, 3818.8) | 20.22 (5.14) |

HMC-5 | 37.0 | 0.700 | (1048.1, 3510.3, 14809.7) | 28.06 (11.69) |

NUTS | 41.3 | >0.7 | (2995.2, 20000.0, 20000.0) | 72.86 (7.31) |

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 3.3 | 0.536 | (8328.4, 8913.2, 9442.4) | 2506.04 (143.47) |

gadMALAe | 4.9 | 0.543 | (8446.7, 9006.6, 9595.6) | 1713.44 (44.91) |

gadRWM | 3.1 | 0.068 | (638.0, 736.9, 803.2) | 205.99 (17.85) |

AM | 3.0 | 0.257 | (1702.8, 1792.2, 1902.0) | 570.19 (49.32) |

RWM | 2.1 | 0.252 | (1129.2, 1627.8, 1979.8) | 534.21 (43.54) |

MALA | 2.8 | 0.542 | (2976.0, 5683.0, 9726.5) | 1046.48 (54.86) |

HMC-5 | 14.7 | 0.678 | (9205.3, 10818.1, 16136.5) | 626.55 (196.48) |

NUTS | 7.5 | >0.7 | (9436.2, 17463.5, 20000.0) | 1265.99 (73.01) |

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 4.6 | 0.545 | (5407.6, 5810.3, 6467.6) | 1176.12 (79.54) |

gadMALAe | 6.8 | 0.547 | (5469.6, 5963.6, 6421.1) | 801.03 (16.07) |

gadRWM | 4.2 | 0.267 | (635.6, 760.0, 866.2) | 150.70 (9.73) |

AM | 4.1 | 0.273 | (612.7, 729.1, 854.8) | 149.18 (10.40) |

RWM | 3.2 | 0.246 | (354.6, 496.4, 709.6) | 111.81 (6.16) |

MALA | 4.0 | 0.509 | (1524.9, 2457.2, 3853.6) | 377.17 (25.80) |

HMC-5 | 20.3 | 0.711 | (7295.7, 12798.7, 18267.4) | 359.22 (103.55) |

NUTS | 15.2 | >0.7 | (15343.3, 18606.0, 20000.0) | 1008.97 (42.33) |

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 4.1 | 0.551 | (3892.9, 4362.7, 4784.2) | 946.98 (56.10) |

gadMALAe | 6.4 | 0.560 | (3832.4, 4372.3, 4845.6) | 599.51 (30.00) |

gadRWM | 3.8 | 0.288 | (342.5, 440.9, 536.1) | 88.94 (10.29) |

AM | 3.2 | 0.238 | (342.5, 425.5, 535.4) | 106.97 (7.18) |

RWM | 2.3 | 0.266 | (196.9, 314.3, 472.7) | 86.57 (11.33) |

MALA | 3.5 | 0.530 | (1429.7, 2310.6, 3260.4) | 404.96 (18.57) |

HMC-5 | 18.4 | 0.699 | (1913.2, 5600.3, 11883.0) | 103.81 (39.38) |

NUTS | 15.4 | >0.7 | (20000.0, 20000.0, 20000.0) | 1295.13 (15.74) |

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 11.0 | 0.560 | (2734.9, 3414.5, 3928.6) | 252.91 (37.86) |

gadMALAe | 22.4 | 0.549 | (2808.2, 3384.9, 3883.5) | 126.00 (14.68) |

gadRWM | 10.4 | 0.248 | (179.1, 252.5, 323.1) | 17.92 (4.18) |

AM | 12.6 | 0.262 | (121.9, 207.6, 308.0) | 9.72 (1.25) |

RWM | 8.4 | 0.233 | (45.0, 153.8, 298.7) | 5.48 (1.83) |

MALA | 9.2 | 0.535 | (420.1, 1313.2, 2573.7) | 47.37 (10.22) |

HMC-5 | 43.4 | 0.706 | (3020.2, 10294.4, 20000.0) | 71.62 (49.62) |

NUTS | 47.4 | >0.7 | (7737.2, 20000.0, 20000.0) | 166.93 (30.57) |

## Appendix E Results on a higher dimensional example

We all tried a much larger Bayesian binary classification problem by taking all $11339$ training examples of "5" and "6" MNIST digits which are $28\times 28$ images and therefore the dimensionality of the parameter vector $w$ was $785$ (the plus one accounts for the bias term). For this larger example from the baselines we applied the gradient-based schemes, MALA, HMC and NUTS since the other methods become very inefficient. From the proposed schemes we applied the most efficient algorithm which is gadMALAf. Also because of the much higher dimensionality of this problem, which makes the stochastic optimisation over $L$ harder, we had to decrease the baseline learning rate in the RMSprop schedule from $0.00015$ to $0.00001$. We also considered a larger adaptation phase consisted of $5\times {10}^{4}$ instead of $2\times {10}^{4}$. All other algorithms use the same experimental settings as described in the main paper. Figure 7 shows the evolution of the log-target densities for all sampling schemes while Table 7 gives ESS, computation times and other statistics.

We can observe that the performance of gadMALAf is reasonably good: it outperforms all methods apart form NUTS. NUTS is better in this example, but it takes around 22 hours to run (since it performs on average 550 gradient evaluations per sampling iteration). Finally, to visualise some part of the learned $L$ found by gadMALAf, Figure 7 depicts the $784$ diagonal elements of $L$ as an $28\times 28$ grey-scale image. Clearly, gadMALAf manages to perform a sort of feature selection, i.e. to discover that the border pixels in MNIST digits do not really take part in the classification, so it learns a much higher variance for those dimensions (close to the variance of the prior).

Method | Time(s) | Accept Rate | ESS (Min, Med, Max) | Min ESS/s (1 st.d.) |
---|---|---|---|---|

gadMALAf | 779.3 | 0.575 | (46.0, 128.7, 282.8) | 0.059 (0.00) |

MALA | 311.8 | 0.530 | (2.8, 5.9, 28.4) | 0.009 (0.00) |

HMC-5 | 1847.4 | 0.733 | (4.5, 23.1, 162.7) | 0.002 (0.00) |

HMC-10 | 3381.3 | 0.589 | (13.9, 66.5, 576.0) | 0.004 (0.00) |

HMC-20 | 6449.1 | 0.666 | (77.8, 240.1, 2060.9) | 0.012 (0.00) |

NUTS | 83232.1 | >0.7 | (18514.1, 20000.0, 20000.0) | 0.223 (0.01) |