Abstract
We develop amortized population Gibbs (APG) samplers, a new class ofautoencoding variational methods for deep probabilistic models. APG samplersconstruct highdimensional proposals by iterating over updates tolowerdimensional blocks of variables. Each conditional update is a neuralproposal, which we train by minimizing the inclusive KL divergence relative tothe conditional posterior. To appropriately account for the size of the inputdata, we develop a new parameterization in terms of neural sufficientstatistics, resulting in quasiconjugate variational approximations.Experiments demonstrate that learned proposals converge to the known analyticalconditional posterior in conjugate models, and that APG samplers can learninference networks for highlystructured deep generative models when theconditional posteriors are intractable. Here APG samplers offer a path towardscaling up stochastic variational methods to models in which standardautoencoding architectures fail to produce accurate samples.
Quick Read (beta)
Abstract
We develop amortized population Gibbs (APG) samplers, a new class of autoencoding variational methods for deep probabilistic models. APG samplers construct highdimensional proposals by iterating over updates to lowerdimensional blocks of variables. Each conditional update is a neural proposal, which we train by minimizing the inclusive KL divergence relative to the conditional posterior. To appropriately account for the size of the input data, we develop a new parameterization in terms of neural sufficient statistics, resulting in quasiconjugate variational approximations. Experiments demonstrate that learned proposals converge to the known analytical conditional posterior in conjugate models, and that APG samplers can learn inference networks for highlystructured deep generative models when the conditional posteriors are intractable. Here APG samplers offer a path toward scaling up stochastic variational methods to models in which standard autoencoding architectures fail to produce accurate samples.
References
assert \algnewcommand\Assert[1]\State \algorithmicassert(#1) \algnewcommand\algorithmicswitchswitch \algnewcommand\algorithmiccasecase \algnewcommand\algorithmicdefaultdefault \algdefSE[SWITCH]SwitchEndSwitch[1]\algorithmicswitch #1 \algorithmicdo\algorithmicend \algorithmicswitch \algdefSE[CASE]CaseEndCase[1]\algorithmiccase #1\algorithmicend \algorithmiccase \algdefSE[DEFAULT]DefaultEndDefault[1]\algorithmicdefault #1\algorithmicend \algorithmicdefault \algtext*EndSwitch \algtext*EndCase \algtext*EndDefault \algnewcommand\algorithmicmatchmatch \algdefSE[MATCH]MatchEndMatch[1]\algorithmicmatch #1\algorithmicend \algorithmicmatch \algtext*EndMatch \algnewcommand\algorithmictrytry \algnewcommand\algorithmiccatchcatch \algblockdefx[Try]TryEndTry\algorithmictry\algorithmicend \algorithmictry \algtext*EndTry \algcblockdefx[Catch]TryCatchEndTry[1]\algorithmiccatch #1\algorithmicend \algorithmictry\algtext*EndTry \algdefSE[QUERY]QueryEndQuery[2]query \textproc#1 \glsdisablehyper \newacronymSCFMscfmstochastic controlflow model \newacronymWSwswakesleep \newacronymBWSbwsbasic wakesleep \newacronymRWSrwsreweighted wakesleep \newacronymELBOelboevidence lower bound \newacronymVAEvaevariational autoencoder \newacronymIWAEiwaeimportance weighted autoencoder \newacronymKLklKullbackLeibler \newacronymSGDsgdstochastic gradient descent \newacronymVIMCOvimcovariational inference for Monte Carlo objectives \newacronymWWwwwakewake \newacronymWWSwwswakewakesleep \newacronymAIRairAttend, Infer, Repeat \newacronymESSesseffective sample size \newacronymREINFORCEreinforceReinforce gradient estimator \newacronymISisimportance sampling \newacronymGMMgmmGaussian mixture model \newacronymMNISTmnisthandwritten digit dataset \newacronymRELAXrelaxRELAX gradient estimator \newacronymREBARrebarREBAR gradient estimator \newacronymPMFpmfprobability mass function \newacronymMLPmlpmultilayer perceptron \newacronymRNNrnnrecurrent neural network \newacronymPCFGpcfgprobabilistic context free grammar \newacronymADAMadamADAM \glsunsetADAM
Amortized Population Gibbs Samplers with Neural Sufficient Statistics
Hao Wu Northeastern University [email protected] Heiko Zimmermann Northeastern University [email protected] Eli Sennesh Northeastern University [email protected]
Tuan Anh Le Massachusetts Institute of Technology [email protected] JanWillem van de Meent Northeastern University [email protected]
1 Introduction
Deep probabilistic programming libraries such as Edward [tran2016edward], Pyro [bingham2018pyro], and Probabilistic Torch [siddharth2017learning] extend deep learning frameworks with functionality for deep probabilistic models which combine a generative model with an inference model that approximates the Bayesian posterior. Both models are parameterized using neural networks, which are trained using stochastic gradient descent by optimize a lower or upper bound on the log marginal likelihood. Training an inference network to perform amortized inference can be equivalently understood as a form of variational inference or adaptive importance sampling.
At present, deep probabilistic models most commonly have the form of standard variational autoencoders (VAEs) [kingma2013autoencoding, rezende2014stochastic]. In these architectures, the generative model combines an unstructured prior (e.g. a spherical Gaussian) with a likelihood that is parameterized by an expressive neural network, often referred to as a decoder. The inference network, known as an encoder, maps input data (e.g. an image or sentence) onto an embedding vector, also known as the latent code.
Deep probabilistic programming aims to enable more general designs that incorporate structured priors for tasks such as multiple object detection [eslami2016attend], language modeling [esmaeili2019structured], or object tracking [kosiorek2018sequential]. In these domains, a prior can incorporate useful inductive biases, such as the requirement that object trajectories are smooth. These biases in turn can help guide a model to uncover patterns in the data in an unsupervised manner, and aid generalization in complex domains where the training data may not contain exemplars for all possible combinations of latent features.
However, training structured models also poses challenges that are not encountered in unstructured problems. To optimize a lower or an upper bound, we need to approximate the gradient of an expectation with a Monte Carlo estimate (see [mohamed2019monte] for a recent review). Standard VAEs rely on reparameterized estimators that can often approximate the gradient with a single sample. Unfortunately, these estimators can have a high variance in models where latent variables are highdimensional and/or strongly correlated. Owing to these limitations, models that are trained using standard VAE objectives often consider relatively smallscale problems, such as tracking $\le 2$ objects over the course of 10 frames [kosiorek2018sequential], or assigning $\le 10$ sentences in a review to distinct aspects [esmaeili2019structured].
In this paper, we develop methods for amortized inference that are designed to scale to structured models with 100s of latent variables. We are particularly interested in the frequently arising cases of models that are characterized by a combinations of local variables, such as the timedependent position of an object, and global variables, such as the shape of the object. In this type of model, it is often the case that knowledge of the local variables can help us make predictions about global variables and vice versa; If we know the shape of an object, then it should be easier to identify its location in an image. Conversely, if we know the position of an object in each frame, then we can more readily infer its shape.
The methods that we develop in this paper are similar in spirit to work by Johnson et al. [johnson2016composing], who developed methods for conjugateexponential models with a neural likelihood. In this setting, we can perform inference using variational expectation maximization (EM) algorithms [beal2003variational, bishop2006pattern, wainwright2008graphical] that exploit conjugacy and conditional independence to derive closedform updates to blocks of variables. The advantage of these approaches is that they are highly computationally efficient; variational EM can often converge in a small number of iterations and easily scales to much larger number of variables. Unfortunately, variational EM is also modelspecific, difficult to implement, and only applicable to a restricted class of conjugateexponential models.
To overcome the limitations imposed by conjugateexponential family models, we here develop a more general approach. Rather than requiring exact EM updates we develop an importance sampling method that employs conditional proposals to iterate between updates to blocks of variables. To train these proposals, we define a a variational method that minimizes the inclusive KL divergence between the proposal update and the exact conditional posterior. We refer to the resulting class of methods as amortized Gibbs samplers, since the proposals approximate Gibbs updates.
The variational objective that we derive is not computable, since the exact Gibbs updates are in general intractable. However, we can nonetheless derive a Monte Carlo estimator for its gradient. Building on a recent body of work that employs importance samplers to train variational distributions [burda2016importance, le2018autoencoding, maddison2017filtering, naesseth2018variational], we develop a sequential Monte Carlo sampler [delmoral2006sequential] that combines approximate Gibbs updates with resampling steps in order to construct high quality proposals, which serve both to compute gradient estimates at train time and to perform inference at test time. We demonstrate correctness of the proposed sampler by proving that samples are properly weighted relative to the generative model [naesseth2015nested].
One of the challenges in designing networks that parameterize conditional proposals is network outputs need to appropriately account for the amount of data on which we are conditioning; The conditional posterior on the mean for a cluster with a large number of points is more tightly peaked than that of a cluster with a small number of points. To address this difficulty, we propose a class of networks that we refer to as neural sufficient statistics, which define parameterizations of proposals in a manner that is additive in the local variables, much like the sufficient statistics in conjugateexponential families.
Our experiments show that learned proposals converge to the true conditional posteriors in Gaussian mixture models, where the Gibbs updates can be computed in closed form. Moreover we establish that amortized Gibbs methods serve can a basis for scalable inference in structured deep generative models, including mixtures with neural likelihoods and unsupervised tracking models. Both of these tasks are representative of the current stateofthe art in unsupervised approaches for learning structured deep generative models.
2 Amortized Population Gibbs Samplers
We are interested in the task of jointly training a generative model ${p}_{\theta}(x,z)$ by maximizing its marginal likelihood ${p}_{\theta}(x)$ and learning an inference model ${q}_{\varphi}(z\mid x)$ that approximates the posterior ${p}_{\theta}(z\mid x)$. Like most amortized inference approaches, we assume that we can sample from a (possibly implicit) distribution $\widehat{p}(x)$ that either takes the form of an empirical distribution over training data or a data simulator.
As a means of generating highquality samples in an incremental manner, we develop methods that are inspired by expectation maximization and classic Gibbs sampling strategies, which perform iterative updates to blocks of variables. Concretely, we will assume that the latent variables in the generative model decompose into blocks $z=\{{z}_{1},\mathrm{\dots},{z}_{B}\}$ and train proposals $\mathrm{log}{q}_{\varphi}({z}_{b}\mid x,{z}_{b})$ that update the variables in a each block ${z}_{b}$ conditioned on the variables in the remaining blocks ${z}_{b}=z\setminus \{{z}_{b}\}$.
Starting with an initial sample ${q}_{\varphi}({z}^{1}\mid x)$ from a standard encoder we will generate a sequence of samples $\{{z}^{1},\mathrm{\dots},{z}^{K}\}$ by performing conditional updates to each block ${z}_{b}$, which we refer to as a sweep
${q}_{\varphi}\left({z}^{k}\mid x,{z}^{k1}\right)$  $={\displaystyle \prod _{b=1}^{B}}{q}_{\varphi}({z}_{b}^{k}\mid x,{z}_{\prec b}^{k},{z}_{\succ b}^{k1}),$  (1) 
where $$ and ${z}_{\succ b}=\{{z}_{i}\mid i>b\}$. Repeatedly applying sweep updates then yields a proposal
$${q}_{\varphi}({z}^{1},\mathrm{\dots},{z}^{K}\mid x)={q}_{\varphi}({z}^{1}\mid x)\prod _{k=2}^{K}{q}_{\varphi}({z}^{k}\mid x,{z}^{k1}).$$ 
We want to train proposals that improve the quality of each sample ${z}^{k}$ relative to that of the preceding sample ${z}^{k1}$. There are two possible strategies for accomplishing this. One strategy is to define an objective that minimizes the discrepancy between the marginal ${q}_{\varphi}({z}^{K}\mid x)$ for the final sample and the posterior ${p}_{\theta}({z}^{K}\mid x)$. This corresponds to learning a sweep update ${q}_{\varphi}({z}^{k}\mid x,{z}^{k1})$ that transforms the initial proposal to the posterior in exactly $K$ sweeps. An example of this type of approach, albeit one that does not employ block updates, is the recent work on annealing variational objectives [huang2018improving].
In this paper, we will pursue a different approach. Instead of transforming the initial proposal in exactly $K$ steps, we learn a sweep update that leaves the target density invariant
$${p}_{\theta}({z}^{k}x)=\int {p}_{\theta}({z}^{k1}x){q}_{\varphi}({z}^{k}x,{z}^{k1})\mathit{d}{z}^{k1}.$$  (2) 
When this condition is met, the proposal ${q}_{\varphi}({z}^{1},\mathrm{\dots}{z}^{K}x)$ is a Markov Chain whose stationary distribution is the posterior. This means a sweep update learned at training time can be applied at test time to iteratively improve sample quality, without requiring a prespecified number of updates $K$.
In addition, when we require that each single block update ${q}_{\varphi}({z}_{b}^{\prime}\mid x,{z}_{b})$ also leaves the target density invariant,
${p}_{\theta}({z}_{b}^{\prime},{z}_{b}\mid x)$  $={\displaystyle \int}{p}_{\theta}({z}_{b},{z}_{b}\mid x){q}_{\varphi}({z}_{b}^{\prime},\mid x,{z}_{b})d{z}_{b},$  (3)  
$={p}_{\theta}({z}_{b}\mid x){q}_{\varphi}({z}_{b}^{\prime}\mid x,{z}_{b}),$ 
Then we see that a block update must equal the exact conditional posterior, ${q}_{\varphi}({z}_{b}^{\prime}\mid x,{z}_{b})={p}_{\theta}({z}_{b}^{\prime}\mid x,{z}_{b})$. In other words, when the condition in Equation 3 is met, the proposal ${q}_{\varphi}({z}^{1},\mathrm{\dots}{z}^{K}x)$ is a Gibbs sampler.
2.1 Variational Objective
To learn each of the block proposals ${q}_{\varphi}({z}_{b}\mid x,{z}_{b})$ we will minimize the inclusive KL divergence ${\mathcal{K}}_{b}(\varphi )$
${\mathbb{E}}_{\widehat{p}(x){p}_{\theta}({z}_{b}x)}\left[\text{KL}({p}_{\theta}({z}_{b}\mid x,{z}_{b}){q}_{\varphi}({z}_{b}\mid x,{z}_{b}))\right].$  (4) 
Unfortunately, this objective is intractable, since we are not able to evaluate the density of the true marginal ${p}_{\theta}({z}_{b}\mid x)$, nor that of the conditional ${p}_{\theta}({z}_{b}\mid {z}_{b},x)$. As we will discuss in Section 5, this has implications for the evaluation of learned proposals, since we cannot compute a lower or upper bound on the log marginal likelihood as in other variational methods. However, it nonetheless possible to approximate the gradient of the objective
${\nabla}_{\varphi}{\mathcal{K}}_{b}(\varphi )$  $={\mathbb{E}}_{\widehat{p}(x){p}_{\theta}({z}_{b},{z}_{b}x)}\left[{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({z}_{b}x,{z}_{b})\right].$ 
We can estimate this gradient using any Monte Carlo method that generates samples $z\sim {p}_{\theta}(z\mid x)$ from the posterior. In the next section, we will use the learned proposals to define an importance sampler, which we then use to compute an selfnormalized estimator of the gradient from weighted samples ${\{({w}^{l},{z}^{l})\}}_{l=1}^{L}$,
${\nabla}_{\varphi}{\mathcal{K}}_{b}(\varphi )\simeq {\displaystyle \sum _{l=1}^{L}}{\displaystyle \frac{{w}^{l}}{{\sum}_{{l}^{\prime}}{w}^{{l}^{\prime}}}}{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({z}_{b}^{l}x,{z}_{b}^{l}).$  (5) 
In problems where we would like to learn a deep generative model ${p}_{\theta}(x,z)$, we can apply a similar selfnormalized gradient estimator of the form
${\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x)$  $={\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,z)\right]$  (6)  
$\simeq {\displaystyle \sum _{l=1}^{L}}{\displaystyle \frac{{w}^{l}}{{\sum}_{{l}^{\prime}}{w}^{{l}^{\prime}}}}{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,{z}^{l}).$ 
This identity holds due to the standard property ${\mathbb{E}}_{{p}_{\theta}(zx)}[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(zx)]=0$ (see Appendix A for details).
The estimator in Equation 5 is similar to the selfnormalized estimator in reweighted wakesleep methods [bornschein2014reweighted], which also minimizes an inclusive KL divergence. This estimator has a number of advantages over the estimator that is commonly used to train standard VAEs, which minimize an exclusive KL divergence [le2019revisiting]. Standard VAE objectives rely on reparameterization to compute gradient estimates. For discrete variables, reparameterization is not possible. This means that we need to compute likelihoodratio estimators (also known as REINFORCEstyle estimators [williams1992simple]), which can have very high variance. A range of approaches for variance reduction have been put forward, including continuous relaxations that are amenable to reparameterization [maddison2017concrete, jang2017categorical], credit assignment techniques (see [weber2019credit] for a review), and other control variates [mnih2016variational, tucker2017rebar, grathwohl2018backpropagation].
The estimator in Equation 5 sidesteps the need for these variance reduction techniques. To compute this gradient, we only require that the proposal density is differentiable, whereas reparameterized estimators require that the sample itself is differentiable. This is a milder condition, that holds for most distributions of interest, including those over discrete variables. Moreover, since this estimator minimizes the inclusive KL divergence, and not the exclusive KL divergence, there is smaller risk of learning a proposal that collapses to a single mode of a multimodal posterior [le2019revisiting].
2.2 Generating High Quality Samples
Approximating the gradient presents a chickenandegg problem; we need samples from the posterior to compute a Monte Carlo estimate of the gradient, but generating these samples is precisely what we are hoping to use learned proposals for in the first place. Moreover, selfnormalized importance samplers are consistent, but they are not unbiased. In the early stages of training, we will have poor quality proposals, which means that the bias of the gradient estimators in Equations 5 and 6 can be high.
Standard reweighted wakesleep methods generate proposals from an encoder $z\sim {q}_{\varphi}(z\mid x)$ and compute weights $w={p}_{\theta}(x,z)/{q}_{\varphi}(z\mid x)$. A wellknown limitation of this type of naive importance sampling strategy is that the computed weights will have a very high variance in models with highdimensional and/or correlated latent variables, which in turn implies a high bias of the estimator. There is a very broad class of importance sampling strategies that can be employed to reduce the variance of importance weights. If we replace the naive importance sampler in reweighted wakesleep with a more sophisticated sampling strategy, then this both improves the quality of gradient estimates at training, and the quality of inference results at test time.
To improve upon standard reweighted wakesleep methods, we will we use the learned proposals to define a sequential Monte Carlo (SMC) sampler [delmoral2006sequential]. SMC methods [doucet2001sequential] combine two basic ideas. The first is sequential importance sampling, which decomposes a proposal for a sequence of variables into a sequence of conditional proposals. The second is resampling, which selects partial proposals with probability proportional to their weights in order to improve the overall sample quality. Most commonly, SMC methods are used in the context of state space models to generate proposals for a sequence of variables by proposing one variable at a time. SMC samplers (see Algorithm 2.2) are a subclass of SMC methods that interleave resampling with the application of a transition kernel, which is sometimes also referred to as resamplemove SMC.
The distinction between SMC methods for state space models and SMC samplers is subtle but important. Whereas the former generate proposals for a sequence of variables ${z}_{1:t}$ by proposing ${z}_{t}\sim q({z}_{t}\mid {z}_{1:t1})$ to extend the sample space at each iteration, SMC samplers can be understood as an importance sampling analogue to Markov chain Monte Carlo (MCMC) methods [brooks2011handbook], which construct a Markov chain ${z}^{1:k}$ by generating a proposal $q({z}^{k}\mid {z}^{k1})$ from a transition kernel at each iteration.
Sequential Importance Sampling. To understand how approximate Gibbs proposals can be used in a SMC sampler, we will first explain how they can be used to define a sequential importance sampler, which decomposes the importance weight into a sequence of incremental weights. In general, SIS considers a sequence of unnormalized target densities ${\gamma}^{1}({z}^{1}),{\gamma}^{2}({z}^{1:2}),\mathrm{\dots},{\gamma}^{K}({z}^{1:K})$. If we now consider an initial proposal ${q}^{1}({z}^{1})$, along with a sequence of conditional proposals ${q}^{k}({z}^{k}\mid {z}^{1:k1})$, then we can recursively construct a sequence of weights ${w}^{k}={v}^{k}{w}^{k1}$ by assuming ${w}^{1}={\gamma}^{1}({z}^{1})/{q}^{1}({z}^{1})$ and defining the incremental weight
${v}^{k}$  $={\displaystyle \frac{{\gamma}^{k}({z}^{1:k})}{{\gamma}^{k1}({z}^{1:k1}){q}^{k}({z}^{k}\mid {z}^{1:k1})}}.$ 
This construction ensures that, at step $k$ in the sequence, we have a weight ${w}^{k}$ relative to the intermediate density ${\gamma}^{k}({z}^{1:k})$ of the form (see Appendix B)
${w}^{k}={\displaystyle \frac{{\gamma}^{k}({z}^{1:k})}{{q}^{1}({z}^{1}){\prod}_{{k}^{\prime}=2}^{k}{q}^{{k}^{\prime}}({z}^{{k}^{\prime}}\mid {z}^{1:{k}^{\prime}1})}}.$ 
[!t] \setstretch1.2 \[email protected]@algorithmic[1] \For$l=1$ to $L$ \State${z}^{1,l}\sim {q}^{1}(\cdot )$\CommentPropose \State${w}^{1,l}=\frac{{\gamma}^{1}({z}^{1,l})}{{q}^{1}({z}^{1,l})}$\CommentWeigh \EndFor\For$k=2$ to $K$ \State${z}^{k1,1:L},{w}^{k1,1:L}=\mathrm{\text{resample}}({z}^{k1,1:L},{w}^{k1,1:L})$
$l=1$ to $L$ \State${z}^{k,l}\sim {q}^{k}(\cdot \mid {z}^{k1,l})$\CommentPropose \State${w}^{k,l}=\frac{{\gamma}^{k}({z}^{k,l}){r}^{k1}({z}^{k1,l}\mid {z}^{k,l})}{{\gamma}^{k1}({z}^{k1,l}){q}^{k}({z}^{k,l}\mid {z}^{k1,l})}{w}^{k1,l}$\CommentWeigh \EndFor\EndFor We will now consider a specific sequence of intermediate densities that are defined using a reverse kernel $r({z}^{\prime}\mid z)$
${\gamma}^{k}({z}^{1:k})={p}_{\theta}(x,{z}^{k}){\displaystyle \prod _{{k}^{\prime}=2}^{k}}r({z}^{{k}^{\prime}1}\mid {z}^{{k}^{\prime}}).$ 
This defines a density on an extended space such that
${\gamma}^{k}({z}^{k})={\displaystyle \int {\gamma}^{k}({z}^{1:k})\mathit{d}{z}^{1:k1}}={p}_{\theta}(x,{z}^{k}).$ 
This means that at each step $k$, we can treat the preceding samples ${z}^{k1}$ as auxiliary variables; if we generate a proposal ${z}^{1:k}$ and simply disregard ${z}^{1:k1}$, then the pair $({w}^{k},{z}^{k})$ is a valid importance sample relative to ${p}_{\theta}({z}^{k}\mid x)$. If we additionally condition proposals on $x$, the incremental weight for this particular choice of target densities is
${v}^{k}={\displaystyle \frac{{p}_{\theta}(x,{z}^{k})r({z}^{k1}\mid {z}^{k})}{{p}_{\theta}(x,{z}^{k1})q({z}^{k}\mid {z}^{k1})}}.$  (7) 
This construction defines a valid importance sampler for any choice of proposal kernel $q({z}^{k}\mid {z}^{k1})$ and reverse kernel $r({z}^{k1}\mid {z}^{k})$. For a given choice of proposal, the optimal reverse kernel is
$r({z}^{k1}\mid {z}^{k})={\displaystyle \frac{{p}_{\theta}(x,{z}^{k1})}{{p}_{\theta}(x,{z}^{k})}}q({z}^{k}\mid {z}^{k1}).$ 
For this choice of kernel, the incremental weights are 1, which minimizes the variance of the weights ${w}^{k}$.
We will now use the approximate Gibbs kernel from Equation 1 as both the forward and the reverse kernel
$q({z}^{k}\mid {z}^{k1})=r({z}^{k1}\mid {z}^{k})={q}_{\varphi}({z}^{k}\mid x,{z}^{k1}).$  (8) 
When the approximate Gibbs kernel converges to the actual Gibbs kernel, this choice becomes optimal, since the kernel will satisfy detailed balance in this limit
${p}_{\theta}(x,{z}^{k}){q}_{\varphi}({z}^{k1}x,{z}^{k})={p}_{\theta}(x,{z}^{k1}){q}_{\varphi}({z}^{k}x,{z}^{k1}).$ 
Resampling. In general, the weights ${w}^{k}$ in the sequential importance sampling scheme defined above will have a high variance. The weights ${w}^{1}$ are just normal importance sampling weights, which themselves will have a high variance when $z$ is highdimensional, or there are correlations between variables. Moreover, we are now sampling these same variables $k$ times. When the approximate Gibbs kernel converges to the true kernel, this will not increase the variance of weights (since ${v}^{k}=1$ in this limit), but during training variance of weights ${w}^{k}$ will increase with $k$, since we are now jointly sampling an entire Markov chain. {algorithm}[!t] \setstretch1.2 \[email protected]@algorithmic[1] \StateInput: ${z}^{\mathrm{\hspace{0.25em}1}:L},{w}^{\mathrm{\hspace{0.25em}1}:L}$ \StateOutput: ${z}^{\prime \mathrm{\hspace{0.25em}1}:L},{w}^{\prime \mathrm{\hspace{0.25em}1}:L}$ \For$i=1$ to $L$ \State${a}^{i}\sim \mathrm{Disc}({\{{w}^{l}/{\sum}_{{l}^{\prime}=1}^{L}{w}^{{l}^{\prime}}\}}_{l=1}^{L})$\CommentIndex Selection \StateSet ${z}^{\prime i}={z}^{{a}^{i}}$ \StateSet ${w}^{\prime i}=\frac{1}{L}{\sum}_{l=1}^{L}{w}^{l}$\CommentReWeigh \EndFor\StateReturn ${z}^{\prime \mathrm{\hspace{0.25em}1}:L},{w}^{\prime \mathrm{\hspace{0.25em}1}:L}$ To overcome this problem, SMC samplers interleave application of the transition kernel with a resampling step. This step generates a new set of samples by selecting current samples with replacement, with probability proportional to their weight. Concretely, suppose that we have a set incoming samples ${\{({w}^{k,l},{z}^{k,l})\}}_{l=1}^{L}$, then the resampling procedure (see Algorithm 2.2) selects index $a$ with probability $P(a=l)={w}^{k,l}/{\sum}_{{l}^{\prime}}{w}^{k,{l}^{\prime}}$ and returns an outgoing sample ${z}^{\prime k,l}={z}^{k,a}$ whose ${w}^{\prime k,l}=\frac{1}{L}{\sum}_{l}{w}^{k,l}$ is equal to the average weight. The reduces the variance of the importance weights at the expense of also reducing the diversity of the sample set; highweight samples are selected frequently, whereas lowweight samples are selected infrequently or not at all.
When we perform resampling after each sweep, we reduce the variance of importance weights to an extent. However we will likely still have highvariance weights, since each sample from the approximate Gibbs kernel still constitutes a highdimensional proposal over all variables in the model. To further reduce the variance, we will employ resampling after each block update, rather than after each sweep. Because the incoming weights are now equal at each block update, we can compute gradient estimates using incremental weights $v$ of the form
$v={\displaystyle \frac{{p}_{\theta}(x,{z}_{b}^{\prime},{z}_{b}){q}_{\varphi}({z}_{b}\mid x,{z}_{b})}{{p}_{\theta}(x,{z}_{b},{z}_{b}){q}_{\varphi}({z}_{b}^{\prime}\mid x,{z}_{b})}}.$  (9) 
These incremental weights will have a much lower variance than the incremental weights for a full sweep, since we are now able to decompose a sampling problem for all the variables in a model into sampling problems for individual blocks. In models with many latent variables, such as the ones that we will consider in our experiments, this has the potential to greatly increase the tractability of the gradient estimation problem.
We refer to this implementation of a SMC sampler as an amortized population Gibbs (APG) sampler, and summarize all the steps of the computation in Algorithm 2.2. In Appendix D, we prove that this algorithm is correct using an argument based on proper weighting. More informally this property holds due to the fact that this sampler is a specific instance of a SMC sampler.
[!tb] \setstretch1.2 \[email protected]@algorithmic[1] \State${g}_{\varphi}=0,{g}_{\theta}=0$\CommentInitialize gradient to 0 \For$l=1$ to $L$\CommentInitial proposal \State${z}^{1,l}\sim {q}_{\varphi}(\cdot \mid x)$\CommentPropose \State${w}^{1,l}=\frac{{p}_{\theta}(x,{z}^{1,l})}{{q}_{\varphi}({z}^{1,l}\mid x)}$\CommentWeigh \EndFor\State${g}_{\varphi}={g}_{\varphi}+{\sum}_{l=1}^{L}\frac{{w}^{1,l}}{{\sum}_{{l}^{\prime}=1}^{L}{w}^{1,{l}^{\prime}}}{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({z}^{1,l}\mid x)$ \State${g}_{\theta}={g}_{\theta}+{\sum}_{l=1}^{L}\frac{{w}^{1,l}}{{\sum}_{{l}^{\prime}=1}^{L}{w}^{1,{l}^{\prime}}}{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,{z}^{1,l})$ \For$k=2$ to $K$\CommentGibbs sweeps \State${\stackrel{~}{z}}^{1:L},{\stackrel{~}{w}}^{1:L}={z}^{k1,1:L},{w}^{k1,1:L}$ \For$b=1$ to $B$\CommentBlock updates \State${\stackrel{~}{z}}^{1:L},{\stackrel{~}{w}}^{1:L}=\mathrm{\text{resample}}({\stackrel{~}{z}}^{1:L},{\stackrel{~}{w}}^{1:L})$ \For$l=1$ to $L$ \State${\stackrel{~}{z}}_{b}^{\prime l}\sim {q}_{\varphi}(\cdot \mid x,{\stackrel{~}{z}}_{b}^{l})$\CommentPropose \State ${\stackrel{~}{w}}^{l}=\frac{{p}_{\theta}(x,{\stackrel{~}{z}}_{b}^{\prime l},{\stackrel{~}{z}}_{b}^{l}){q}_{\varphi}({\stackrel{~}{z}}_{b}^{l}\mid x,{\stackrel{~}{z}}_{b}^{l})}{{p}_{\theta}(x,{\stackrel{~}{z}}_{b}^{l},{\stackrel{~}{z}}_{b}^{l}){q}_{\varphi}({\stackrel{~}{z}}_{b}^{\prime l}\mid x,{\stackrel{~}{z}}_{b}^{l})}{\stackrel{~}{w}}^{l}$ \CommentWeigh \State${\stackrel{~}{z}}_{b}^{l}={\stackrel{~}{z}}_{b}^{\prime l}$ \CommentReassign \EndFor\State${g}_{\varphi}={g}_{\varphi}+{\sum}_{l=1}^{L}\frac{{\stackrel{~}{w}}^{l}}{{\sum}_{{l}^{\prime}=1}^{L}{\stackrel{~}{w}}^{{l}^{\prime}}}{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({\stackrel{~}{z}}_{b}^{l}\mid x,{\stackrel{~}{z}}_{b}^{l})$ \State${g}_{\theta}={g}_{\theta}+{\sum}_{l=1}^{L}\frac{{\stackrel{~}{w}}^{l}}{{\sum}_{{l}^{\prime}=1}^{L}{\stackrel{~}{w}}^{{l}^{\prime}}}{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,{\stackrel{~}{z}}^{l})$ \EndFor\State${z}^{k,\mathrm{\hspace{0.25em}1}:L},{w}^{k,\mathrm{\hspace{0.25em}1}:L}={\stackrel{~}{z}}^{1:L},{\stackrel{~}{w}}^{1:L}$ \EndFor\Return${g}_{\varphi}$, ${g}_{\theta}$\CommentOutput: gradient
3 Neural Sufficient Statistics
Gibbs sampling strategies that sample from exact conditionals rely on conjugacy relationships. Typically, we assume a prior and likelihood that can both be expressed as exponential families
$p(x\mid z)$  $=h(x)\mathrm{exp}\{\eta {(z)}^{\top}T(x)\mathrm{log}A(\eta (z))\},$  
$p(z)$  $=h(z)\mathrm{exp}\{{\lambda}^{\top}T(z)\mathrm{log}A(\lambda )\}.$ 
In these densities $h(\cdot )$ is a base measure, $T(\cdot )$ is a vector of sufficient statistics, and $A(\cdot )$ is a log normalizer. The two densities are jointly conjugate when
$T(z)=(\eta (z),\mathrm{log}A(\eta (z)))$ 
In this case, the posterior distribution lies in the same exponential family as the prior
$p(z\mid x)\propto h(z)\mathrm{exp}\{$  ${({\lambda}_{1}+T(x))}^{\top}T(z)$  
$({\lambda}_{2}+1)\mathrm{log}A(\eta (z))\}.$ 
Typically, the prior $p(z\mid \lambda )$ and likelihood $p(x\mid z)$ are not jointly conjugate, but it is possible to identify conjugacy relationships at the level of individual blocks of variables,
$p({z}_{b}\mid {z}_{b},x)\propto h({z}_{b})\mathrm{exp}\{$  ${({\lambda}_{b,1}+T(x,{z}_{b}))}^{\top}T({z}_{b})$  
$({\lambda}_{b,2}+1)\mathrm{log}A(\eta ({z}_{b}))\}.$ 
In the more general setting we consider here, these conjugacy relationships will typically not hold. However, we can still take inspiration to design variational distributions that make use of conditional independencies in a model. We will assume that each of the approximate Gibbs updates ${q}_{\varphi}({z}_{b}\mid x,{z}_{b})$ is an exponential family, whose parameters are computed from a vector of prior parameters $\lambda $ and a vector of neural sufficient statistics ${T}_{\varphi}(x,{z}_{b})$
${q}_{\varphi}({z}_{b}\mid x,{z}_{b})=p({z}_{b}\mid \lambda +{T}_{\varphi}(x,{z}_{b})).$  (10) 
This parameterization has a number of desirable properties. Exponential families are the largestentropy distributions that match the moments defined by the sufficient statistics (see e.g. [wainwright2008graphical]), which is helpful when minimizing the inclusive KL divergence. In exponential families it is also more straightforward to control the entropy of the variational distribution. In particular, we can initialize ${T}_{\varphi}(x,{z}_{b})$ to output values close to zero in order to ensure that we initially propose from a prior and/or regularize ${T}_{\varphi}(x,{z}_{b})$ to help avoid local optima.
A particularly useful case arises in models where the data $x=\{{x}_{1},\mathrm{\dots},{x}_{N}\}$ are independent conditioned on $z$. In these models it is often possible to partition the latent variables $z=\{{z}^{\mathrm{\text{g}}},{z}^{\mathrm{\text{l}}}\}$ into global and local variables ${z}^{\mathrm{\text{g}}}$ and local variables ${z}^{\mathrm{\text{l}}}$. The dimensionality of global variables is typically constant, whereas local variables ${z}^{\mathrm{\text{l}}}=\{{z}_{1}^{\mathrm{\text{l}}},\mathrm{\dots},{z}_{N}^{\mathrm{\text{l}}}\}$ have a dimensionality that increases with the data $N$. For models with this structure, the local variables are typically conditionally independent ${z}_{n}^{\mathrm{\text{L}}}\perp {z}_{n}^{\mathrm{\text{L}}}\mid x,{z}^{\mathrm{\text{g}}}$, which means that we can parameterize the sufficient statistics as
${\stackrel{~}{\lambda}}^{\mathrm{\text{g}}}$  $={\lambda}^{\mathrm{\text{g}}}+{\displaystyle \sum _{n=1}^{N}}{T}_{\varphi}^{\mathrm{\text{g}}}({x}_{n},{z}_{n}^{\mathrm{\text{l}}}),$  ${\stackrel{~}{\lambda}}_{n}^{\mathrm{\text{l}}}$  $={\lambda}_{n}^{\mathrm{\text{l}}}+{T}_{\varphi}^{\mathrm{\text{l}}}({x}_{n},{z}^{\mathrm{\text{g}}}).$ 
The advantage of this parameterization is it allows us to train approximate Gibbs updates for global variables in a manner that scales dynamically with the size of the dataset, and appropriately adjusts the posterior variance according to the amount of available data.
4 Related Work
Our work fits into a line of recent methods for deep generative modeling that seek to improve inference quality, either by introducing auxiliary variables [maaloe2016auxiliary, ranganath2016hierarchical], or by performing iterative updates [marino2018iterative]. Our specific approach to learning block proposals is related to a number of methods that, in some way or other, combine transition kernels with variational inference. Work by Hoffman [hoffman2017learning] applies Hamiltonian Monte Carlo to samples that are generated from the encoder, which serves to improve the gradient estimate w.r.t. $\theta $ (Equation 6), while learning the inference network using a standard reparameterized ELBO objective. Li et al. [li2017approximate] similarly use MCMC to improve the quality of samples that are generated by an encoder, but additionally use these samples to train the encoder by minimizing the inclusive KL divergence relative to the filtering distribution of the Markov chain. As in our work, the filtering distribution after multiple MCMC steps is intractable. Li et al. therefore use an adversarial objective to minimize the inclusive KL. Neither of these two lines of work consider block decomposition of the latent variable space, nor do they learn transition kernels.
Work by Salimans et al. [salimans2015markov] uses transition kernels in variational inference. The authors use an importance weight to define (stochastic) lower bound, which is defined using a forward and reverse kernel in the same manner as in Equation 7. Huang et al. [huang2018improving] extend the work by Salimans et al. by learning a sequence of transition kernels that performs annealing from the initial encoder to the posterior. Since both these methods minimize an exclusive KL, rather than an inclusive KL, gradient estimates must be computed using reparameterization, which means that these methods are not applicable to models that contain discrete variables. Moreover, these methods perform a joint update on all variables at each iteration, and do not consider the task of learning conditional proposals as we do here.
Work by Wang et al. [wang2018meta] develops a metalearning approach to learning Gibbs block conditionals. This work assumes a setup in which it is possible to sample $x,z\sim p(x,z)$ from the true generative model $p(x,z)$, which means gradients can be estimated using sleepphase Monte Carlo estimators. This circumvents the need for selfnormalized estimators of the form in Equation 5, which are necessary when we additionally wish to learn the generative model. Like in our work, the approach by Wang et al. minimizes the inclusive KL, but uses the learned conditionals to directly define an (approximate) MCMC sampler, rather than using them as proposals in an SMC sampler. This work also has a somewhat different focus from ours, in that it primarily seeks to learn block conditionals that have the potential to generalize to previously unseen graphical models.
5 Experiments
We evaluate APG methods in 3 tasks. We begin by considering a Gaussian mixture model (GMM) as an exemplar of a model in the conjugateexponential family. Here we verify that the learned block updates converge the analytical conditional posteriors as predicted by our analysis in Section 2. We next consider a deep generative mixture model (DGMM) that incorporates a neural likelihood to parameterize ringshaped clusters. We show that we can train both the generative model and inference model in an endtoend manner using APG methods, and that inference scales to datasets containing up to 600 points. For both models we quantify performance in terms of the effective sample size (ESS) and the relative magnitude of the log joint. In our third experiment, we consider an unsupervised model for multiple bouncing MNIST data. We extend the task proposed by Srivastava et al. [srivastava2015unsupervised] to consider up to 5 individual digits, and learn both a deep generative model for videos and an inference model that performs tracking.


Results on each of these tasks constitute a significant advance relative to the state of the art. Standard VAEs perform poorly at Gaussian mixture modeling tasks, and to our knowledge there are no existing methods that scale to a problem of the complexity of the DGMM for rings. In the context of the unsupervised tracking model, APG easily scales beyond previously reported results for a specialized recurrent architecture [kosiorek2018sequential]. APG is not only is able to scale to models with higher complexity in these settings, but also provides a general framework for performing inference in models with global and local variables, which can be adapted to a variety of model classes with comparative ease.
5.1 Gaussian Mixture Model
To evaluate whether APG samplers can learn the exact Gibbs updates in conditionally conjugate models, we consider a Gaussian mixture model
${\mu}_{i},{\tau}_{i}\sim \text{NormGamma}({\mu}_{0},{\nu}_{0},{\alpha}_{0},{\beta}_{0}),i$  $=1,2..,I$  
${c}_{n}\sim \mathrm{Cat}(\pi ),{x}_{n}{c}_{n}=i\sim \text{Norm}({\mu}_{i},1/{\tau}_{i}),n$  $=1,2,..,N$ 
In this model, the global variables ${z}^{\mathrm{\text{g}}}=\{{\mu}_{1:I},{\tau}_{1:I}\}$ are the mean an precision for each mixture component, whereas the local variables are the cluster assignments ${z}^{\mathrm{\text{l}}}=\{{c}_{1:N}\}$. Conditioned on cluster assignments, the Gaussian likelihood $p({x}_{1:N}\mid {z}_{1:N},{\mu}_{1:I},{\tau}_{1:I})$ is conjugate to a normalgamma prior $p({\mu}_{1:I},{\tau}_{1:I})$ with sufficient statistics $T({x}_{n},{c}_{n})$
$\{\mathrm{I}[{c}_{n}=i],\mathrm{I}[{c}_{n}=i]{x}_{n},\mathrm{I}[{c}_{n}=i]{x}_{n}^{2}i=1,2,\mathrm{\dots},I\},$ 
where $\mathrm{I}[{z}_{n}=i]$ is an indicator function that evaluates to 1 if the equality holds, and 0 otherwise.
We employ a variational distribution that updates the global variables ${q}_{\varphi}(\mu ,\tau \mid x,c)$ and the local variables ${q}_{\varphi}(c\mid x,\mu ,\tau )$, using pointwise neural sufficient statistics modeled after the ones in the analytical updates (see Appendix E for architecture details).
We train our models on 20,000 datasets with $I=3$ clusters and $N=60$ data points with fixed hyperparameters (${\mu}_{0}=0$, ${\nu}_{0}=0.3$, ${\alpha}_{0}=2$, ${\beta}_{0}=2$). We use $20$ GMM datasets per batch, $K=10$ sweeps, $L=10$ particles, and Adam ($\mathrm{lr}={10}^{4},{\beta}_{1}=0.9,{\beta}_{2}=0.99$) for 200,000 iterations.
We compare the APG sampler to samples from a standard encoder with MLP and LSTM architectures, which is trained using reweighted wakesleep (RWS). Both architectures are parameterized using the same neural sufficient statistics as the APG sampler.
Figure 0(a) shows sequences of single samples from the variational distribution, where the first sample is drawn using RWS. Even when using a parameterization that employs neural sufficient statistics, the RWS encoder fails to propose reasonable clusters, whereas the APG sampler typically converges within 12 iterations across a range of dataset sizes.
Furthermore, we would like to quantify how similar learned proposals ${q}_{\varphi}({z}_{b}\mid x,{z}_{b})$ are to the conditional posteriors ${p}_{\theta}({z}_{b}\mid x,{z}_{b})$. With the case of GMM where the exact conditional posterior is tractable, we verify the convergence of the learned proposals by computing the inclusive KL divergence ${\mathcal{K}}_{b}(\varphi )$ defined in equation 4 (see Table 1(a)). We can see that the APG samplers of the both $\{\tau ,\mu \}$ and $\{c\}$ converge to the true conditional posterior.
5.2 Deep Generative Mixture Model
We next consider the task of training a deep generative model ${p}_{\theta}(x,z)$ is jointly with the APG sampler. Our dataset consists of ringshaped clusters. The true generative model (which we assume is unknown) takes the form
${\mu}_{i}\sim \text{Norm}(0,{\sigma}_{0}^{2}I),i=1,2,\mathrm{\dots},I$  
${c}_{n}\sim \mathrm{Disc}(\pi ),{\alpha}_{n}\sim \mathrm{Unif}[0,2\pi ],$  
${x}_{n}{c}_{n}=i\sim \text{Norm}({g}_{\theta}({\alpha}_{n})+{\mu}_{i},{\mathrm{\Sigma}}_{\u03f5}).$ 
Here ${\mu}_{i}$ is center of the $i$th ring. Given a cluster assignment ${c}_{n}$ and an angle ${\alpha}_{n}$ we define a position on a ring, from which We sample a data point ${x}_{n}$ with 2D Gaussian noise.
We train our model on 20,000 datasets with $N=200$ data points and $I=4$ clusters with fixed hyperparameters (${\sigma}_{0}=3.5$, ${\mathrm{\Sigma}}_{\u03f5}=0.2$). We use $20$ datasets per batch, $K=10$ sweeps, $L=10$ particles, and Adam ($\mathrm{lr}={10}^{4},{\beta}_{1}=0.9,{\beta}_{2}=0.99$) for 200,000 iterations (see Appendix E for architecture details).
5.3 Sample Quality Evaluation
In both mixture models, we compute the logjoint distribution $\mathrm{log}{p}_{\theta}(x,z)$ (see Table 1) as a function of sweep iteration to measure the convergence and the effective sample size (see Table 1) to assess proposal quality
$\frac{\text{ESS}}{L}}={\displaystyle \frac{{({\sum}_{l=1}^{L}{w}^{k,l})}^{2}}{L{\sum}_{l=1}^{L}{({w}^{k,l})}^{2}}}.$  (11) 
Log joint $\mathrm{log}{p}_{\theta}(x,z)$. Because the marginal ${q}_{\varphi}({z}^{k}x)$ is intractable, it is difficult to compute an lower bound or upper bound at each sweep. Here we compute the log joint in each test dataset for both the APG sampler with different number of sweeps and the RWS baselines and report the differences on average to see how much more is achieved by the APG sampler. In both models, the APG sampler gains a higher log joint compared with the encoder trained by RWS.
ESS. One advantage of the APG sampler that it decomposes a high dimensional sampling problem into a sequence of lower dimensional sampling problems. To show that, we compute the ESS when 1) we resample only after one sweep and 2) we resample after each block update. WE can see that the granular sampling strategy significantly improves the ESS in both cases.
5.4 Fixed Computation Budget Analysis
As a mean of comparing the performance of APG samplers for varying numbers of sweeps $K$, we perform an experiment in which the computation budget is fixed at $K\cdot L=1000$ samples. Figure 2 shows $\mathrm{log}{p}_{\theta}(x,z)$ and ESS/L. The shaded area denotes the standard deviation over 10 runs that each comprise 5 datasets that were chosen at random. We can see that it in general, it is more effective to perform more APG sweeps $K$ with a smaller number of particles $L$, that it is to increase the particle budget.
5.5 Time Series Model – Bouncing MNIST
Finally, we apply the APG sampler to a time series model that is trained with short timescales, and evaluate its performance with longer timescales and larger numbers of latent variables. The data ${x}_{1:T}$ is a sequence of images of $D$ moving MNIST digits. Our generative model consists of global variables ${z}_{1:D}^{\mathrm{what}}$ corresponding to digit latent variables and local variables ${z}_{1:D,1:T}^{\mathrm{where}}$ corresponding to the digit trajectories. The deep generative model is a state space model that factorizes across digits of the form
${z}_{d}^{\mathrm{what}}$  $\sim \text{Norm}(0,I),{z}_{d,1}^{\mathrm{where}}\sim \text{Norm}(0,I),$  
${z}_{d,t}^{\mathrm{where}}$  $\sim \text{Norm}({z}_{d,t1}^{\mathrm{where}},{\sigma}_{0}^{2}I)$  
${x}_{t}$  $\sim \mathrm{Bern}\left(\sigma \left({\displaystyle \sum _{d}}\mathrm{ST}({\mu}_{\theta}({z}_{d}^{\mathrm{what}}),{z}_{d,t}^{\mathrm{where}})\right)\right)$ 
Here, ST is a spatial transformer [jaderberg2015spatial] that maps the output of a feedforward decoder ${\mu}_{\theta}$ that maps logits for a 28$\times $28 MNIST image onto a 96$\times $96 canvas based on the location variable ${z}_{d,t}^{\mathrm{where}}$.
Our amortized Gibbs updates employ $T+1$ blocks $({z}_{1:D}^{\mathrm{what}},{z}_{1:D,1}^{\mathrm{where}},{z}_{1:D,2}^{\mathrm{where}},\mathrm{\dots},{z}_{1:D,T}^{\mathrm{where}})$. Empirically this works better than splitting the latent variables into global and local variables, since resampling at each time step $t$ helps disentangle the digit locations if they overlap.
We train our model on 60000 bouncing MNIST sequences, each of which contains $D=3$ digits and $T=10$ frame images. We use $10$ sequences per batch, $K=5$ sweeps, $L=10$ particles, and Adam ($\mathrm{lr}={10}^{4},{\beta}_{1}=0.9,{\beta}_{2}=0.99$) for 200,000 iterations (see Appendix E for architecture details).
We show that APG sampler can scale to larger number of variables by testing the model on datasets with $T\in \{20,100\}$ time steps and $D\in \{3,4,5\}$ digits. Figure 3 shows the inference and reconstruction using single samples from the variational distribution. (plots are truncated by the first $15$ time steps due to limited space, see Appendix F for more examples with full time steps). Qualitatively, we see that the digit trajectories ${z}_{1:D,1:T}^{\mathrm{where}}$ and latent variables ${z}_{1:D}^{\mathrm{what}}$ are inferred well. In Figure 4, we show the mean squared error between the video and its reconstruction for different $T$ and $D$. The results confirm that performance improves with increasing number of Gibbs sweeps $K$. In certain cases, a larger number of time points $T$ in fact improves convergence as a function of the number of sweeps $K$.
6 Conclusion
One of the challenges in amortized inference for deep generative models is learning highquality proposals for models with a structured prior over a highdimensional set of latent variables. These priors arise naturally when, rather than encoding a single data point (e.g. an image), we wish to encode a dataset (e.g. a sequence of images). Even for apparently simple problems, such as inferring the cluster parameters and assignments in a mixture model, standard encoders often fail to produce good samples. One of the reasons for this is that it is fundamentally difficult to jointly generate proposals for a highdimensional set of latent variables.
APG samplers are very general, and offer a path towards the development of deep generative models that incorporate structured priors to provide meaningful inductive biases in settings where we have little or no supervision. These methods have particular strengths in problems with global variables, but more generally make it possible to design amortized approaches that exploit conditional independence. Moreover, our parameterization in terms of neural sufficient statistics makes it comparatively easy to design models that scale to much larger number of latent variables and thus generalize to datasets that vary in size.
Immediate lines of future work are to compare the approach in this paper, which learns kernels that leave the target density invariant, with approaches that perform annealing, in which the learned kernels are assymmetric in the sense that they gradually transform the initial encoder distribution to the target density.
7 Acknowledgements
This work was supported by the Intel Corporation, NSF award 1835309, the DARPA LwLL program, and startup funds from Northeastern University. Tuan Anh Le was supported by AFOSR award FA955018S0003.
Appendix A Gradient of the generative model
This is actually a known (although indeed not obvious) identity. Briefly, we can express the expected gradient of the log joint as
${\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,z)\right]{\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x)+{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(zx)\right]{\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x)\right]{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x)$ 
Here we make use of a standard identity that is also used in, e.g., likelihoodratio estimators
${\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(zx)\right]={\displaystyle \int {p}_{\theta}(zx){\nabla}_{\theta}\mathrm{log}{p}_{\theta}(zx)\mathit{d}z}={\displaystyle \int {\nabla}_{\theta}{p}_{\theta}(zx)\mathit{d}z}={\nabla}_{\theta}{\displaystyle \int {p}_{\theta}(zx)\mathit{d}z}={\nabla}_{\theta}1=0$ 
Therefore, we have the the following equality
${\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x)={\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,z)\right].$ 
which is Equation 6. As a result, we can then use selfnormalized importance sampling to approximate ${\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\theta}\mathrm{log}{p}_{\theta}(x,z)\right]$.
Appendix B Importance weights in sequential importance sampling
At step $k=1$, we use exactly the standard importance sampler, thus it is obvious that the following is a valid importance weight
${w}^{1}={\displaystyle \frac{{\gamma}^{1}({z}^{1})}{{q}^{1}({z}^{1})}}.$ 
When step $k>2$, we are going to prove that the importance weight relative to the intermediate densities has the form
${w}^{k}={\displaystyle \frac{{\gamma}^{k}({z}^{1:k})}{{q}^{1}({z}^{1}){\prod}_{{k}^{\prime}=2}^{k}{q}^{{k}^{\prime}}({z}^{{k}^{\prime}}\mid {z}^{1:{k}^{\prime}1})}}.$  (12) 
At step $k=2$, the importance weight is defined as
${w}^{k}$  $={v}^{2}{w}^{1}={\displaystyle \frac{{\gamma}^{2}({z}^{1:2})}{{\gamma}^{1}({z}^{1}){q}^{2}({z}^{2}\mid {z}^{1})}}{\displaystyle \frac{{\gamma}^{1}({z}^{1})}{{q}^{1}({z}^{1})}}={\displaystyle \frac{{\gamma}^{2}({z}^{1:2})}{{q}^{1}({z}^{1}){q}^{2}({z}^{2}\mid {z}^{1})}}.$ 
which is exactly that form. Now we prove weights in future steps by induction. At step $k\ge 2$, assume the weight has the form in Equation 12, i.e.
${w}^{k}={\displaystyle \frac{{\gamma}^{k}({z}^{1:k})}{{q}^{1}({z}^{1}){\prod}_{{k}^{\prime}=2}^{k}{q}^{{k}^{\prime}}({z}^{{k}^{\prime}}\mid {z}^{1:{k}^{\prime}1})}}.$ 
, then at step $k+1$, the importance weight is the product of incremental weight and incoming weight
${w}^{k+1}={v}^{k+1}{w}^{k}={\displaystyle \frac{{\gamma}^{k+1}({z}^{1:k+1})}{{\gamma}^{k}({z}^{1:k}){q}^{k+1}({z}^{k+1}\mid {z}^{1:k})}}{\displaystyle \frac{{\gamma}^{k}({z}^{1:k})}{{q}^{1}({z}^{1}){\prod}_{{k}^{\prime}=2}^{k}{q}^{{k}^{\prime}}({z}^{{k}^{\prime}}\mid {z}^{1:{k}^{\prime}1})}}={\displaystyle \frac{{\gamma}^{k+1}({z}^{1:k+1})}{{q}^{1}({z}^{1}){\prod}_{{k}^{\prime}=2}^{k+1}{q}^{{k}^{\prime}}({z}^{{k}^{\prime}}\mid {z}^{1:{k}^{\prime}1})}}.$ 
Thus the importance weight ${w}^{k}$ has the form of Equation 12 at each step $k>2$ in sequential importance sampling.
Appendix C Derivation of Posterior Invariance
We can see that individual block updates leave the posterior invariant by proposing variables ${z}_{\u2aafb}^{k}$ from a partial kernel $\kappa ({z}_{\u2aafb}^{k}\mid x,{z}^{k1})$ and then marginalize over the corresponding variables from the previous step ${z}_{\u2aafb}^{k1}$,
$\int \mathit{d}{z}_{\u2aafb}^{k1}{p}_{\theta}({z}^{k1}\mid x)\kappa ({z}_{\u2aafb}^{k}\mid x,{z}^{k1})$  $={\displaystyle \int \mathit{d}{z}_{\u2aafb}^{k1}{p}_{\theta}({z}^{k1}\mid x)\int \mathit{d}{z}_{\succ b}^{k}\kappa ({z}^{k}\mid x,{z}^{k1})}$  
$={\displaystyle \int \mathit{d}{z}_{\u2aafb}^{k1}{p}_{\theta}({z}^{k1}\mid x)\prod _{m=1}^{b}{p}_{\theta}({z}_{m}^{k}\mid x,{z}_{\prec m}^{k},{z}_{\succ m}^{k1})}$  
$={\displaystyle \int \mathit{d}{z}_{\u2aafb}^{k1}{p}_{\theta}({z}^{k1}\mid x){p}_{\theta}({z}_{\u2aafb}^{k}\mid x,{z}_{\succ 1}^{k1})}$  
$={p}_{\theta}({z}_{\u2aafb}^{k},{z}_{\succ b}^{k1}\mid x).$ 
Appendix D Proof of the amortized population Gibbs algorithm
Here, we provide an alternative proof of correctness of the APG algorithm given in Algorithm 2.2, based on the construction of proper weights [naesseth2015nested] which was introduced after SMC samplers [delmoral2006sequential]. We first introduce proper weights, and then present several operations that preserve the proper weighting property and finally we apply these properties in proving correctness of APG.
D.1 Proper weights
Definition 1 (Proper weights).
Given an unnormalized density $\stackrel{~}{p}(z)$, with corresponding normalizing constant ${Z}_{p}:=\int \stackrel{~}{p}(z)dz$ and normalized density $p\equiv \stackrel{~}{p}/{Z}_{p}$, the random variables $z,w\sim P(z,w)$ are properly weighted with respect to $\stackrel{~}{p}(z)$ if and only if for any measurable function $f$
${\mathbb{E}}_{P(z,w)}\left[wf(z)\right]={Z}_{p}{\mathbb{E}}_{p(z)}[f(z)].$  (13) 
We will also denote this as
$z,w\stackrel{\text{p.w.}}{\sim}\stackrel{~}{p}.$ 
Using proper weights.
Given independent samples ${z}^{l},{w}^{l}\sim P$, we can estimate ${Z}_{p}$ by setting $f\equiv 1$:
${Z}_{p}\approx {\displaystyle \frac{1}{L}}{\displaystyle \sum _{l=1}^{L}}{w}^{l}.$ 
This estimator is unbiased because it is a Monte Carlo estimator of the left hand side of (13). We can also estimate ${\mathbb{E}}_{p(z)}[f(z)]$ as
${\mathbb{E}}_{p(z)}[f(z)]\approx {\displaystyle \frac{\frac{1}{L}{\sum}_{l=1}^{L}{w}^{l}f({z}^{l})}{\frac{1}{L}{\sum}_{l=1}^{L}{w}^{l}}}.$ 
While the numerator and the denominator are unbiased estimators of ${Z}_{p}{\mathbb{E}}_{p(z)}[f(z)]$ and ${Z}_{p}$ respectively, their fraction is biased. We often write this estimator as
${\mathbb{E}}_{p(z)}[f(z)]\approx {\displaystyle \sum _{l=1}^{L}}{\overline{w}}^{l}f({z}^{l}),$  (14) 
where ${\overline{w}}^{l}:={w}^{l}/{\sum}_{{l}^{\prime}=1}^{L}{w}^{{l}^{\prime}}$ is the normalized weight.
D.2 Operations that preserve proper weights
Proposition 1 (Nested importance sampling).
Adapted from [naesseth2015nested, Algorithm 1]. Given unnormalized densities $\stackrel{\mathrm{~}}{q}\mathit{}\mathrm{(}z\mathrm{)}\mathrm{,}\stackrel{\mathrm{~}}{p}\mathit{}\mathrm{(}z\mathrm{)}$ with the normalizing constants ${Z}_{q}\mathrm{,}{Z}_{p}$ and normalized densities $q\mathit{}\mathrm{(}z\mathrm{)}\mathrm{,}p\mathit{}\mathrm{(}z\mathrm{)}$, if
$z,w\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{q},$  (15) 
then
$z,{\displaystyle \frac{w\stackrel{~}{p}(z)}{\stackrel{~}{q}(z)}}\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{p}.$ 
Proof.
First define the distribution of $z,w$ as $Q$. For measurable $f(z)$
${\mathbb{E}}_{Q(z,w)}\left[{\displaystyle \frac{w\stackrel{~}{p}(z)}{\stackrel{~}{q}(z)}}f(z)\right]={Z}_{q}{\mathbb{E}}_{q(z)}\left[{\displaystyle \frac{\stackrel{~}{p}(z)f(z)}{\stackrel{~}{q}(z)}}\right]={Z}_{q}{\displaystyle \int q(z)\frac{\stackrel{~}{p}(z)f(z)}{\stackrel{~}{q}(z)}dz}={\displaystyle \int \stackrel{~}{p}(z)f(z)dz}={Z}_{p}{\mathbb{E}}_{p(z)}[f(z)].$ 
∎
Proposition 2 (Resampling).
Adapted from [naesseth2015nested, Section 3.1]. Given an unnormalized density $\stackrel{\mathrm{~}}{p}\mathit{}\mathrm{(}z\mathrm{)}$ (normalizing constant ${Z}_{p}$, normalized density $p\mathit{}\mathrm{(}z\mathrm{)}$), if we have a set of properly weighted samples
${z}^{l},{w}^{l}\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{p},l=1,\mathrm{\dots},L$  (16) 
then the resampling operation preserves the proper weighting, i.e.
${z}^{\prime l},{w}^{\prime l}\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{p},l=1,\mathrm{\dots},L$ 
where ${z}^{\mathrm{\prime}\mathit{}l}\mathrm{=}{z}^{a}$ with probability $P\mathrm{(}a\mathrm{=}i\mathrm{)}\mathrm{=}{w}^{i}\mathrm{/}{\mathrm{\sum}}_{l\mathrm{=}\mathrm{1}}^{L}{w}^{l}$ and ${w}^{\mathrm{\prime}\mathit{}l}\mathrm{:=}\frac{\mathrm{1}}{L}\mathit{}{\mathrm{\sum}}_{l\mathrm{=}\mathrm{1}}^{L}{w}^{l}$.
Proof.
Define the distribution of ${z}^{l},{w}^{l}$ as $\widehat{P}$. We show that for any $f$, $\mathbb{E}[f({z}^{a}){w}^{\prime l}]={Z}_{p}{\mathbb{E}}_{p(z)}[f(z)]$.
${\mathbb{E}}_{\left({\prod}_{l=1}^{L}\widehat{P}({z}^{l},{w}^{l})\right)p(a\mid {w}^{1:L})}\left[f({z}^{a}){w}^{\prime l}\right]$  
$={\mathbb{E}}_{{\prod}_{l=1}^{L}\widehat{P}({z}^{l},{w}^{l})}\left[{\displaystyle \sum _{i=1}^{L}}f({z}^{i}){w}^{\prime}P(a=i)\right]$  
$={\mathbb{E}}_{{\prod}_{l=1}^{L}\widehat{P}({z}^{l},{w}^{l})}\left[{\displaystyle \sum _{i=1}^{L}}f({z}^{i}){w}^{\prime}{\displaystyle \frac{{w}^{i}}{{\sum}_{{l}^{\prime}=1}^{L}{w}^{{l}^{\prime}}}}\right]$  
$={\mathbb{E}}_{{\prod}_{l=1}^{L}\widehat{P}({z}^{l},{w}^{l})}\left[{\displaystyle \frac{1}{L}}{\displaystyle \sum _{i=1}^{L}}f({z}^{i}){w}^{i}\right]$  
$={\displaystyle \frac{1}{L}}{\displaystyle \sum _{i=1}^{L}}{\mathbb{E}}_{\widehat{P}({z}^{i},{w}^{i})}\left[f({z}^{i}){w}^{i}\right]={\displaystyle \frac{1}{L}}{\displaystyle \sum _{i=1}^{L}}{Z}_{p}{\mathbb{E}}_{p(z)}[f(z)]={Z}_{p}{\mathbb{E}}_{p(z)}[f(z)].$ 
∎
Therefore, the resampling will return a new set of samples that are still properly weighted relative to the target distribution in the APG sampler (Algorithm 2.2).
Proposition 3 (Move).
Given an unnormalized density $\stackrel{\mathrm{~}}{p}\mathit{}\mathrm{(}z\mathrm{)}$ (normalizing constant ${Z}_{p}$, normalized density $p\mathit{}\mathrm{(}z\mathrm{)}$) and normalized conditional densities $q\mathrm{(}{z}^{\mathrm{\prime}}\mathrm{}z\mathrm{)}$ and $r\mathrm{(}z\mathrm{}{z}^{\mathrm{\prime}}\mathrm{)}$, the proper weighting is preserved if we apply the transition kernel to a properly weighted sample, i.e. if we have
${z}^{l},{w}^{l}\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{p},$  (17)  
${z}^{\prime l}\sim q({z}^{\prime l}{z}^{l}),$  (18)  
${w}^{\prime l}={\displaystyle \frac{\stackrel{~}{p}({z}^{\prime l})r({z}^{l}{z}^{\prime l})}{\stackrel{~}{p}({z}^{l})q({z}^{\prime l}{z}^{l})}}{w}^{l},l=1,\mathrm{\dots},L$  (19) 
then we have
${z}^{\prime l},{w}^{\prime l}\stackrel{\mathit{\text{p.w.}}}{\sim}\stackrel{~}{p},l=1,\mathrm{\dots},L$  (20) 
Proof.
Firstly we simplify the notation by dropping the superscript $l$ without loss of generality. Define the distribution of $z,w$ as $\widehat{P}$. Then, due to (17), for any measurable $f(z)$, we have
${\mathbb{E}}_{P}[wf(z)]={Z}_{p}{E}_{p}[f(z)].$ 
To prove (20), we show ${\mathbb{E}}_{\widehat{P}(z,w)q({z}^{\prime}z)}[{w}^{\prime}f({z}^{\prime})]={Z}_{p}{\mathbb{E}}_{p({z}^{\prime})}[f({z}^{\prime})]$ for any $f$ as follows:
${\mathbb{E}}_{\widehat{P}(z,w)q({z}^{\prime}z)}[{w}^{\prime}f({z}^{\prime})]$  $={\mathbb{E}}_{\widehat{P}(z,w)q({z}^{\prime}z)}\left[{\displaystyle \frac{\stackrel{~}{p}({z}^{\prime})r(z{z}^{\prime})}{\stackrel{~}{p}(z)q({z}^{\prime}z)}}wf({z}^{\prime})\right]$  
$={\displaystyle \int}\widehat{P}(z,w)q({z}^{\prime}z){\displaystyle \frac{\stackrel{~}{p}({z}^{\prime})r(z{z}^{\prime})}{\stackrel{~}{p}(z)q({z}^{\prime}z)}}wf({z}^{\prime})\mathrm{d}z\mathrm{d}w\mathrm{d}{z}^{\prime}$  
$={\displaystyle \int \widehat{P}(z,w)\frac{\stackrel{~}{p}({z}^{\prime})r(z{z}^{\prime})}{\stackrel{~}{p}(z)}wf({z}^{\prime})dzdwd{z}^{\prime}}$  
$={\displaystyle \int \stackrel{~}{p}({z}^{\prime})f({z}^{\prime})\left(\int \widehat{P}(z,w)w\frac{r(z{z}^{\prime})}{\stackrel{~}{p}(z)}dzdw\right)d{z}^{\prime}}$  
$={\displaystyle \int \stackrel{~}{p}({z}^{\prime})f({z}^{\prime}){Z}_{p}{\mathbb{E}}_{p(z)}\left[\frac{r(z{z}^{\prime})}{\stackrel{~}{p}(z)}\right]d{z}^{\prime}}.$  (21) 
Using the fact that ${\mathbb{E}}_{p(z)}\left[\frac{r(z{z}^{\prime})}{\stackrel{~}{p}(z)}\right]=\int p(z)\frac{r(z{z}^{\prime})}{\stackrel{~}{p}(z)}\mathrm{d}z=\int r(z{z}^{\prime})\mathrm{d}z/{Z}_{p}=1/{Z}_{p}$. Equation 21 simplifies to
$\int \stackrel{~}{p}({z}^{\prime})f({z}^{\prime})d{z}^{\prime}}={Z}_{p}{\mathbb{E}}_{p({z}^{\prime})}[f({z}^{\prime})].$ 
∎
D.3 Correctness of APG Sampler
We provide the proof by performing 2 steps in the APG sampler (Algorithm 2.2), i.e, we prove the correctness when we initialize samples at step $k=1$ (line 2.2  line 2.2) and then do one Gibbs sweep at step $k=2$ (line 2.2  line 2.2). In fact, its correctness still holds if we perform more Gibbs sweeps by induction.
Step $k\mathrm{=}\mathrm{1}$. We initialize the proposal $z\sim {q}_{\varphi}(zx)$ (line 2.2) and train that encoder using the wake$\varphi $ phase objective in the standard reweighted wakesleep[le2019revisiting] ${\mathbb{E}}_{p(x)}[\mathrm{\text{kl}}({p}_{\theta}(zx){q}_{\varphi}(zx))]$. Then we estimate its gradient w.r.t. parameter $\varphi $ (line 2.2) as
${g}_{\varphi}:$  $={\nabla}_{\varphi}{\mathbb{E}}_{p(x)}[\mathrm{\text{kl}}({p}_{\theta}(zx){q}_{\varphi}(zx))]$  (22)  
$={\mathbb{E}}_{p(x)}\left[{\mathbb{E}}_{{p}_{\theta}(zx)}\left[{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}(z\mid x)\right]\right].$  (23) 
Step $k\mathrm{=}\mathrm{2}$. After one full sweep, we have the following objective
${\mathbb{E}}_{p(x)}\left[{\displaystyle \sum _{b=1}^{B}}{\mathbb{E}}_{{p}_{\theta}({z}_{b}\mid x)}\left[\mathrm{\text{kl}}({p}_{\theta}({z}_{b}\mid {z}_{b},x){q}_{\varphi}({z}_{b}\mid x,{z}_{b}))\right]\right]$ 
And we will prove that we correctly estimate the following gradient w.r.t. parameter $\varphi $ at each block update (line 2.2)
${g}_{\varphi}^{b}:$  $={\nabla}_{\varphi}{\mathbb{E}}_{p(x)}[{\mathbb{E}}_{{p}_{\theta}({z}_{b}x)}[\mathrm{\text{kl}}\left({p}_{\theta}({z}_{b}{z}_{b},x){q}_{\varphi}({z}_{b}{z}_{b},x))]\right]$  (24)  
$={\mathbb{E}}_{p(x)}[{\mathbb{E}}_{{p}_{\theta}({z}_{1:B}x)}\left[{\nabla}_{\varphi}\mathrm{log}{q}_{\varphi}({z}_{b}{z}_{b},x)]\right],b=1,\mathrm{\dots},B.$  (25) 
At each step, as long as we show that samples are properly weighted
${z}_{1:B}^{l},{w}^{l}\stackrel{\text{p.w.}}{\sim}{p}_{\theta}({z}_{1:B},x),l=1,\mathrm{\dots},L.$  (26) 
Equation 14 will guarantee the validity of both gradient estimations (line 2.2 and line 2.2).
At step $k=1$, samples are properly weighted because ${z}^{l}$ and ${w}^{l}$ are proposed using importance sampling (line 2.2) where ${q}_{\varphi}(zx)$ is the proposal density and ${p}_{\theta}({z}^{l},x)$ is the unnormalized target density. The resampling step (line 2.2) will preserve the proper weighting because of Proposition 2.
To prove that Gibbs sweep (line 2.2  line 2.2) in the APG sampler also preserves proper weighting, we show that each block update satisfies all the 3 conditions (Equation 17, 19 and 26) in Proposition 3, by which we can conclude the samples are still properly weighted after each block update. Without loss of generality, we drop all $l$ superscripts in the rest of the proof. Before we start any block update (before line 2.2), we already know that samples are properly weighted, i.e.
$z,w\stackrel{\text{p.w.}}{\sim}{p}_{\theta}(z,x).$  (27) 
which corresponds Equation 17. Next we define a conditional distribution $q({z}^{\prime}\mid z):={q}_{\varphi}({z}_{b}^{\prime}x,{z}_{b}){\delta}_{{z}_{b}}({z}_{b}^{\prime})$, from which we propose a new sample
${z}^{\prime}\sim {q}_{\varphi}({z}_{b}^{\prime}x,{z}_{b}){\delta}_{{z}_{b}}({z}_{b}^{\prime}),$  (28) 
where the density of ${z}_{b}^{\prime}$ is a delta mass on ${z}_{b}$ defined as ${\delta}_{{z}_{b}}({z}_{b}^{\prime})=1$ if ${z}_{b}={z}_{b}^{\prime}$ and $0$ otherwise. In fact, this sampling step is equivalent to firstly sampling ${z}_{b}^{\prime}\sim {q}_{\varphi}(\cdot x,{z}_{b})$ (line 2.2) and let ${z}_{b}^{\prime}={z}_{b}$, which is exactly what the APG sampler assumes procedurally. This condition corresponds to Equation 18.
Finally, we define the weight ${w}^{\prime}$
${w}^{\prime}={\displaystyle \frac{{{p}}_{{\theta}}{(}{x}{,}{{z}}_{{b}}^{{\prime}}{,}{{z}}_{{}{b}}^{{\prime}}{)}{r}{(}{{z}}_{{b}}{}{x}{,}{{z}}_{{}{b}}{)}{{\delta}}_{{{z}}_{{}{b}}}{(}{{z}}_{{}{b}}{)}}{{{p}}_{{\theta}}{(}{x}{,}{{z}}_{{b}}{,}{{z}}_{{}{b}}{)}{{q}}_{{\varphi}}{(}{{z}}_{{b}}^{{\prime}}{}{x}{,}{{z}}_{{}{b}}{)}{{\delta}}_{{{z}}_{{}{b}}}{(}{{z}}_{{}{b}}^{{\prime}}{)}}}w,$  (29) 
where the terms in blue are treated as densities (normalized or unnormalized) of ${z}_{1:B}^{\prime}$ and the terms in red are treated as densities of ${z}_{1:B}$. Since both delta mass densities evaluate to one, this weight is equal to the weight computed after each block update (line 2.2). This condition corresponds to Equation 19.
Now we can apply the conclusion (20) in Proposition 3 and claim
${z}_{1:B}^{\prime},{w}^{\prime}\stackrel{\text{p.w.}}{\sim}{p}_{\theta}({z}_{1:B}^{\prime},x).$ 
since ${z}_{b}={z}_{b}^{\prime}$ and ${z}_{b}={z}_{b}^{\prime}$ due to the reassignment (line 2.2). Based on the fact that proper weighting is preserves at both initial step $k=1$ and the Gibbs sweep $k=2$, we have proved that both gradient estimations (line 2.2 and line 2.2) are correct.
Appendix E Architecture of the Amortized Population Gibbs samplers
GMM
Layer  ${q}_{\varphi}(\mu ,\tau x)$  

Input  $\mathrm{Concat}[{x}_{n}\in {\mathbb{R}}^{2}]$  
1 
FC 2 
FC 3 Softmax 
Layer  ${q}_{\varphi}(\mu ,\tau x,c)$  

Input  $\mathrm{Concat}[{x}_{n}\in {\mathbb{R}}^{2},{c}_{n}\in {\mathbb{R}}^{3}]$  
1 
FC 2 
FC 3 Softmax 
Layer  ${q}_{\varphi}(cx,\mu ,\tau )$ 

Input  $\mathrm{Concat}[{x}_{n}\in {\mathbb{R}}^{2},{\mu}_{i}\in {\mathbb{R}}^{2}]$ 
1 
FC 32 Tanh 
2  FC 1, Intermediate Variable ${o}_{i}\in \mathbb{R}$ 
3  $\mathrm{Concat}[{o}_{i}\in \mathbb{R}]$, Softmax (${c}_{n}$) 
DGMM
Layer  ${q}_{\varphi}(\mu x)$  

Input  ${x}_{n}\in {\mathbb{R}}^{2}$  
1 
FC 32 Tanh 
FC 32 Tanh 
2 
FC 16 Tanh, ${v}_{n}\in \mathbb{R}$ 
FC 4 Softmax, ${\gamma}_{n}\in {\mathbb{R}}^{3}$ 
3  ${T}_{n}:={\gamma}_{n}\otimes {v}_{n}\in {\mathbb{R}}^{3\times 16}$  
4  $\mathrm{Concat}[{\sum}_{n}^{N}{T}_{n}[i],{\mu}_{0}\in {\mathbb{R}}^{2},\mathrm{Diag}({\sigma}_{0}^{2}I)\in {\mathbb{R}}^{2}],i=1,2,3,4$  
5  FC 2$\times $32 Tanh  
6  FC 2$\times $8 (${\mu}_{1:I}$) 
Layer  ${q}_{\varphi}(\mu z,c)$  

Input  $\mathrm{Concat}[{x}_{n}\in {\mathbb{R}}^{2},{c}_{n}\in {\mathbb{R}}^{3}]$  
1 
FC 32 Tanh 
FC 32 Tanh 
2 
FC 16, ${v}_{n}\in \mathbb{R}$ 
FC 4 Softmax, ${\gamma}_{n}\in {\mathbb{R}}^{3}$ 
3  ${T}_{n}:={\gamma}_{n}\otimes {v}_{n}\in {\mathbb{R}}^{3\times 16}$  
4  $\mathrm{Concat}[{\sum}_{n}^{N}{T}_{n}[i],{\mu}_{0}\in {\mathbb{R}}^{2},\mathrm{Diag}({\sigma}_{0}^{2}I)\in {\mathbb{R}}^{2}],i=1,2,3,4$  
5  FC 2$\times $32 Tanh  
6  FC 2$\times $2 (${\mu}_{i}$) 
Layer  ${q}_{\varphi}(cz,\mu )$ 

Input  $\mathrm{Concat}[{x}_{n}\in {\mathbb{R}}^{2},{\mu}_{i}\in {\mathbb{R}}^{2}]$ 
1  FC 32 Tanh 
2 
FC 1, Intermediate Variable ${o}_{i}\in \mathbb{R}$ 
3 
$\mathrm{Concat}[{o}_{i}\in \mathbb{R}]$, Softmax (${c}_{n}$) 
Layer  ${q}_{\varphi}(\alpha x,z,\mu )$ 

Input  ${x}_{n}{\mu}_{i}\in {\mathbb{R}}^{2}{z}_{n}=i$ 
1  FC 32 Tanh 
2 
FC 1 Tanh 
Layer  ${p}_{q}(x\mu ,c,\alpha )$ 

Input  $\mathrm{Concat}[{\alpha}_{n},{c}_{n}]\in {\mathbb{R}}^{5}$ 
1  FC 32 Tanh 
2  FC 2 Tanh (${\mu}_{n}$, fixed ${\sigma}_{\u03f5}$) 
Bouncing MNIST
Layer  ${p}_{\theta}(x{z}^{\mathrm{what}},{z}^{\mathrm{where}})$ 

Input  ${z}_{i}^{\mathrm{what}}\in {\mathbb{R}}^{10}$ 
1  FC 200 ReLU 
2  FC 400 ReLU 
3  digit ${d}_{i}\in {\mathbb{R}}^{784}$ 
4  ST(${d}_{i},{z}_{i,t}^{\mathrm{where}})\in \mathbb{R}{}^{9276},i=1\mathrm{\dots},i,t=1\mathrm{\dots},T$ 
Layer  ${q}_{\varphi}({z}^{\mathrm{what}}{z}^{\mathrm{where}})$ 

Input  ${x}_{t}\in {\mathbb{R}}^{9276},{z}_{i,t}^{\mathrm{where}}\in {\mathbb{R}}^{2},i=1\mathrm{\dots},I,t=1\mathrm{\dots},T$ 
1  ST(${x}_{t}$, ${z}_{i,t}^{\mathrm{where}}$) $\in {\mathbb{R}}^{784},i=1,..,I,t=1,\mathrm{\dots},T$ 
2  FC 400 ReLU 
3  FC 200 ReLU 
4  ${z}_{i,t}^{\mathrm{what}}\in {\mathbb{R}}^{10},i=1,..,I,t=1,\mathrm{\dots},T$ 
5  Mean(${z}_{1,t}^{\mathrm{what}}$, $1:T$) $\in {\mathbb{R}}^{10},i=1,\mathrm{\dots},I$ 
Layer  ${q}_{\varphi}({z}^{\mathrm{where}}{z}^{\mathrm{what}})$ 

Input  ${x}_{t}\in {\mathbb{R}}^{9276}$, ${z}_{i,t}^{\mathrm{what}},i=1,..,t=1,..,T$ 
1  Conv2d(${x}_{t}$, ${z}_{i,t}^{\mathrm{what}}$)$\in {\mathbb{R}}^{4638}$, $i=1,..,t=1,..,T$ 
2  FC 400 Tanh 
3  $2\times $ FC 200 Tanh 
4  $2\times 2$ Tanh 
Appendix F More Qualitative Results of Bouncing MNIST
The following are full reconstructions on test sets where time steps $T=100$ and number of digits $D=3,4,5$, respectively. In each figure, the 1st, 3rd, 5th, 7th, 9th rows show the inference results, while the other rows show the reconstruction of the series above.