Abstract
Variational autoencoders provide a principled framework for learning deeplatentvariable models and corresponding inference models. In this work, weprovide an introduction to variational autoencoders and some importantextensions.
Quick Read (beta)
An Introduction to
Variational Autoencoders
December 2017
Abstract
Variational autoencoders provide a principled framework for learning deep latentvariable models and corresponding inference models. In this work, we provide an introduction to variational autoencoders and some important extensions.
Contents

1 Introduction
 1.1 Motivation
 1.2 Aim
 1.3 Probabilistic Models and Variational Inference
 1.4 Parameterizing Conditional Distributions with Neural Networks
 1.5 Directed Graphical Models and Neural Networks
 1.6 Learning in Fully Observed Models with Neural Nets
 1.7 Learning and Inference in Deep Latent Variable Models
 1.8 Intractabilities

2 Variational Autoencoders
 2.1 Encoder or Approximate Posterior
 2.2 Evidence Lower Bound (ELBO)
 2.3 Stochastic GradientBased Optimization of the ELBO
 2.4 Reparameterization Trick
 2.5 Factorized Gaussian posteriors
 2.6 Estimation of the Marginal Likelihood
 2.7 Marginal Likelihood and ELBO as KL Divergences
 2.8 Challenges
 2.9 Related prior and concurrent work
 3 Beyond Gaussian Posteriors
 4 Deeper Generative Models
 5 Conclusion
 6 Appendix
Chapter 1 Introduction
1.1 Motivation
One major division in machine learning is generative versus discriminative modeling. While in discriminative modeling one aims to learn a predictor given the observations, in generative modeling one aims to solve the more general problem of learning a joint distribution over all the variables. A generative model simulates how the data is generated in the real world. “Modeling” is understood in almost every science as unveiling this generating process by hypothesizing theories and testing these theories through observations. For instance, when meteorologists model the weather they use highly complex partial differential equations to express the underlying physics of the weather. Or when an astronomer models the formation of galaxies s/he encodes in his/her equations of motion the physical laws under which stellar bodies interact. The same is true for biologists, chemists, economists and so on. Modeling in the sciences is in fact almost always generative modeling.
There are many reasons why generative modeling is attractive. First, we can express physical laws and constraints into the generative process while details that we don’t know or care about, i.e. nuisance variables, are treated as noise. The resulting models are usually highly intuitive and interpretable and by testing them against observations we can confirm or reject our theories about how the world works.
Another reason for trying to understand the generative process of data is that it naturally expresses causal relations of the world. Causal relations have the great advantage that they generalize much better to new situations than mere correlations. For instance, once we understand the generative process of an earthquake, we can use that knowledge both in California and in Chile.
To turn a generative model into a discriminator, we need to use Bayes rule. For instance, we have a generative model for an earthquake of type A and another for type B, then seeing which of the two describes the data best we can compute a probability for whether earthquake A or B happened. Applying Bayes rule is however often computationally expensive.
In discriminative methods we directly learn a map in the same direction as we intend to make future predictions in. This is in the opposite direction than the generative model. For instance, one can argue that an image is generated in the world by first identifying the object, then generating the object in 3D and then projecting it onto an pixel grid. A discriminative model takes these pixel values directly as input and maps them to the labels. While generative models can learn efficiently from data, they also tend to make stronger assumptions on the data than their purely discriminative counterparts, often leading to higher asymptotic bias (Banerjee, 2008) when the model is wrong. For this reason, if the model is wrong (and it almost always is to some degree!), if one is solely interested in learning to discriminate, and one is in a regime with a sufficiently large amount of data, then purely discriminative models typically will lead to fewer errors in discriminative tasks. Nevertheless, depending on how much data is around, it may pay off to study the data generating process as a way to guide the training of the discriminator, such as a classifier. For instance, one may have few labeled examples and many more unlabeled examples. In this semisupervised learning setting, one can use the generative model of the data to improve classification (Kingma et al., 2014; Kaae Sønderby et al., 2016).
Generative modeling can be useful more generally. One can think of it as an auxiliary task. For instance, predicting the immediate future may help us build useful abstractions of the world that can be used for multiple prediction tasks downstream. This quest for disentangled, semantically meaningful, statistically independent and causal factors of variation in data is generally known as unsupervised representation learning and indeed the VAE has been extensively employed for that purpose. Alternatively, one may view this as an implicit form of regularization: by forcing the representations to be meaningful for data generation, we bias the inverse of that process, which maps from input to representation, into a certain mould. The auxiliary task of predicting the world is used to better understand the world at an abstract level and thus to better make downstream predictions.
The VAE can be viewed as two coupled, but independently parameterized models: the encoder or recognition model, and the decoder or generative model. These two models support each other. The recognition model delivers to the generative model an approximation to its posterior over latent random variables, which it needs to update its parameters inside an iteration of “expectation maximization” learning. Reversely, the generative model is a scaffolding of sorts for the recognition model to learn meaningful representations of the data, including possibly classlabels. The recognition model is the approximate inverse of the generative model according to Bayes rule.
One advantage of the VAE framework, relative to ordinary Variational Inference (VI), is that the recognition model (also called inference model) is now a (stochastic) function of the input variables. This in contrast to VI where each datacase has a separate variational distribution, which is inefficient for large datasets. The recognition model uses one set of parameters to model the relation between input and latent variables and as such is called “amortized inference”. This recognition model can be arbitrary complex but is still reasonably fast because by construction it can be done using a single feedforward pass from input to latent variables. However the price we pay is that this sampling induces sampling noise in the gradients required for learning. Perhaps the greatest contribution of the VAE framework is the realization that we can counteract this variance by using what is now known as the “reparameterization trick”, a simple procedure to reorganize our gradient computation that reduces variance in the gradients.
The VAE is inspired by the Helmholtz Machine Dayan et al. (1995) which was perhaps the first model that employed a recognition model. However, its wakesleep algorithm was inefficient and didn’t optimize a single objective. The VAE learning rules instead follow from a single approximation to the maximum likelihood objective.
VAEs marry graphical models and deep learning. The generative model is a Bayesian network of the form $p(\mathbf{x}\mathbf{z})p(\mathbf{z})$, or, if there are multiple stochastic latent layers, a hierarchy such as $p(\mathbf{x}{\mathbf{z}}_{L})p({\mathbf{z}}_{L}{\mathbf{z}}_{L1})\mathrm{\dots}p({\mathbf{z}}_{1}{\mathbf{z}}_{0})$. Similarly, the recognition model is also a conditional Bayesian network of the form $q(\mathbf{z}\mathbf{x})$ or as a hierarchy, such as $q({\mathbf{z}}_{0}{\mathbf{z}}_{1})\mathrm{\dots}q({\mathbf{z}}_{L}X)$. But inside each conditional may hide a complex (deep) neural network, e.g. $\mathbf{z}\mathbf{x}\sim f(\mathbf{x},\mathit{\u03f5})$, with $f$ a neural network mapping and $\mathit{\u03f5}$ a noise random variable. Its learning algorithm is a mix of classical (amortized, variational) expectation maximization but through the reparameterization trick ends up backpropagating through the many layers of the deep neural networks embedded inside of it.
Since its inception, the VAE framework has been extended in many directions, e.g. to dynamical models (Johnson et al., 2016), models with attention (Gregor et al., 2015), models with multiple levels of stochastic latent variables Kingma et al. (2016), and many more. It has proven itself as a fertile framework to build new models in. More recently, another generative modeling paradigm has gained significant attention: the generative adversarial network (GAN) (Goodfellow et al., 2014). VAEs and GANs seem to have complementary properties: while GANs can generate images of high subjective perceptual quality, they tend to lack full support over the data (Grover et al., 2017), as opposed to likelihoodbased generative models. VAEs, like other likelihoodbased models, generate more dispersed samples, but are better density models in terms of the likelihood criterion. As such many hybrid models have been proposed to try to represent the best of both worlds (Dumoulin et al., 2016; Grover et al., 2017; Rosca et al., 2018).
As a community we seem to have embraced the fact that generative models and unsupervised learning play an important role in building intelligent machines. We hope that the VAE provides a useful piece of that puzzle.
1.2 Aim
The framework of variational autoencoders (VAEs) (Kingma and Welling, 2013; Rezende et al., 2014) provides a principled method for jointly learning deep latentvariable models and corresponding inference models using stochastic gradient descent. The framework has a wide array of applications from generative modeling, semisupervised learning to representation learning.
This work is meant as an expanded version of our earlier work (Kingma and Welling, 2013), allowing us to explain the topic in finer detail and to discuss a selection of important followup work. This is not aimed to be a comprehensive review of all related work. We assume that the reader has basic knowledge of algebra, calculus and probability theory.
In this chapter we discuss background material: probabilistic models, directed graphical models, the marriage of directed graphical models with neural networks, learning in fully observed models and deep latentvariable models (DLVMs). In chapter 2 we explain the basics of VAEs. In chapter 3 we explain advanced inference techniques, followed by an explanation of advanced generative models in chapter 4. Please refer to section 6.1 for more information on mathematical notation.
1.3 Probabilistic Models and Variational Inference
In the field of machine learning, we are often interested in learning probabilistic models of various natural and artificial phenomena from data. Probabilistic models are mathematical descriptions of such phenomena. They are useful for understanding such phenomena, for prediction of unknowns in the future, and for various forms of assisted or automated decision making. As such, probabilistic models formalize the notion of knowledge and skill, and are central constructs in the field of machine learning and AI.
As probabilistic models contain unknowns and the data rarely paints a complete picture of the unknowns, we typically need to assume some level of uncertainty over aspects of the model. The degree and nature of this uncertainty is specified in terms of (conditional) probability distributions. Models may consist of both continuousvalued variables and discretevalued variables. The, in some sense, most complete forms of probabilistic models specify all correlations and higherorder dependencies between the variables in the model, in the form of a joint probability distribution over those variables.
Let’s use $\mathbf{x}$ as the vector representing the set of all observed variables whose joint distribution we would like to model. Note that for notational simplicity and to avoid clutter, we use lower case bold (e.g. $\mathbf{x}$) to denote the underlying set of observed random variables, i.e. flattened and concatenated such that the set is represented as a single vector. See section 6.1 for more on notation.
We assume the observed variable $\mathbf{x}$ is a random sample from an unknown underlying process, whose true (probability) distribution ${p}^{*}(\mathbf{x})$ is unknown. We attempt to approximate this underlying process with a chosen model ${p}_{\bm{\theta}}(\mathbf{x})$, with parameters $\bm{\theta}$:
$\mathbf{x}$  $\sim {p}_{\bm{\theta}}(\mathbf{x})$  (1.1) 
Learning is, most commonly, the process of searching for a value of the parameters $\bm{\theta}$ such that the probability distribution function given by the model, ${p}_{\bm{\theta}}(\mathbf{x})$, approximates the true distribution of the data, denoted by ${p}^{*}(\mathbf{x})$, such that for any observed $\mathbf{x}$:
${p}_{\bm{\theta}}(\mathbf{x})\approx {p}^{*}(\mathbf{x})$  (1.2) 
Naturally, we wish ${p}_{\bm{\theta}}(\mathbf{x})$ to be sufficiently flexible to be able to adapt to the data, such that we have a chance of obtaining a sufficiently accurate model. At the same time, we wish to be able to incorporate knowledge about the distribution of data into the model that is known a priori.
1.3.1 Conditional Models
Often, such as in case of classification or regression problems, we are not interested in learning an unconditional model ${p}_{\bm{\theta}}(\mathbf{x})$, but a conditional model ${p}_{\bm{\theta}}(\mathbf{y}\mathbf{x})$ that approximates the underlying conditional distribution ${p}^{*}(\mathbf{y}\mathbf{x})$: a distribution over the values of variable $\mathbf{y}$, conditioned on the value of an observed variable $\mathbf{x}$. In this case, $\mathbf{x}$ is often called the input of the model. Like in the unconditional case, a model ${p}_{\bm{\theta}}(\mathbf{y}\mathbf{x})$ is chosen, and optimized to be close to the unknown underlying distribution, such that for any $\mathbf{x}$ and $\mathbf{y}$:
${p}_{\bm{\theta}}(\mathbf{y}\mathbf{x})\approx {p}^{*}(\mathbf{y}\mathbf{x})$  (1.3) 
A relatively common and simple example of conditional modeling is image classification, where $\mathbf{x}$ is an image, and $\mathbf{y}$ is the image’s class, as labeled by a human, which we wish to predict. In this case, ${p}_{\bm{\theta}}(\mathbf{y}\mathbf{x})$ is typically chosen to be a categorical distribution, whose parameters are computed from $\mathbf{x}$.
Conditional models become more difficult to learn when the predicted variables are very highdimensional, such as images, video or sound. One example is the reverse of the image classification problem: prediction of a distribution over images, conditioned on the class label. Another example with both highdimensional input, and highdimensional output, is time series prediction, such as text or video prediction.
To avoid notational clutter we will often assume unconditional modeling, but one should always keep in mind that the methods introduced in this work are, in almost all cases, equally applicable to conditional models. The data on which the model is conditioned, can be treated as inputs to the model, similar to the parameters of the model, with the obvious difference that one doesn’t optimize over their value.
1.4 Parameterizing Conditional Distributions with Neural Networks
Differentiable feedforward neural networks, from here just called neural networks, are a particularly flexible and computationally scalable type of function approximator. Learning of models based on neural networks with multiple ’hidden’ layers of artificial neurons is often referred to as deep learning (Goodfellow et al., 2016; LeCun et al., 2015). A particularly interesting application is probabilistic models, i.e. the use of neural networks for probability density functions (PDFs) or probability mass functions (PMFs) in probabilistic models. Probabilistic models based on neural networks are computationally scalable since they allow for stochastic gradientbased optimization which, as we will explain, allows scaling to large models and large datasets. We will denote a deep neural network as a vector function: $\text{NeuralNet}(.)$.
At the time of writing, deep learning has been shown to work well for a large variety of classification and regression problems, as summarized in (LeCun et al., 2015; Goodfellow et al., 2016). In case of neuralnetwork based image classification LeCun et al. (1998), for example, neural networks parameterize a categorical distribution ${p}_{\bm{\theta}}(y\mathbf{x})$ over a class label $y$, conditioned on an image $\mathbf{x}$.
$\mathbf{p}$  $=\text{NeuralNet}(\mathbf{x})$  (1.4)  
${p}_{\bm{\theta}}(y\mathbf{x})$  $=\text{Categorical}(y;\mathbf{p})$  (1.5) 
where the last operation of $\text{NeuralNet}(.)$ is typically a softmax() function such that ${\sum}_{i}{p}_{i}=1$.
1.5 Directed Graphical Models and Neural Networks
We work with directed probabilistic models, also called directed probabilistic graphical models (PGMs), or Bayesian networks. Directed graphical models are a type of probabilistic models where all the variables are topologically organized into a directed acyclic graph. The joint distribution over the variables of such models factorizes as a product of prior and conditional distributions:
${p}_{\bm{\theta}}({\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{M})={\displaystyle \prod _{j=1}^{M}}{p}_{\bm{\theta}}({\mathbf{x}}_{j}Pa({\mathbf{x}}_{j}))$  (1.6) 
where $Pa({\mathbf{x}}_{j})$ is the set of parent variables of node $j$ in the directed graph. For nonrootnodes, we condition on the parents. For root nodes, the set of parents is the empty set, such that the distribution is unconditional.
Traditionally, each conditional probability distribution ${p}_{\bm{\theta}}({\mathbf{x}}_{j}Pa({\mathbf{x}}_{j}))$ is parameterized as a lookup table or a linear model (Koller and Friedman, 2009). As we explained above, a more flexible way to parameterize such conditional distributions is with neural networks. In this case, neural networks take as input the parents of a variable in a directed graph, and produce the distributional parameters $\bm{\eta}$ over that variable:
$\bm{\eta}$  $=\text{NeuralNet}(Pa(\mathbf{x}))$  (1.7)  
${p}_{\bm{\theta}}(\mathbf{x}Pa(\mathbf{x}))$  $={p}_{\bm{\theta}}(\mathbf{x}\bm{\eta})$  (1.8) 
We will now discuss how to learn the parameters of such models, if all the variables are observed in the data.
1.6 Learning in Fully Observed Models with Neural Nets
If all variables in the directed graphical model are observed in the data, then we can compute and differentiate the logprobability of the data under the model, leading to relatively straightforward optimization.
1.6.1 Dataset
We often collect a dataset $\mathcal{D}$ consisting of $N\ge 1$ datapoints:
$\mathcal{D}=\{{\mathbf{x}}^{(1)},{\mathbf{x}}^{(2)},\mathrm{\dots},{\mathbf{x}}^{(N)}\}\equiv {\{{\mathbf{x}}^{(i)}\}}_{i=1}^{N}\equiv {\mathbf{x}}^{(1:N)}$  (1.9) 
The datapoints are assumed to be independent samples from an unchanging underlying distribution. In other words, the dataset is assumed to consist of distinct, independent measurements from the same (unchanging) system. In this case, the observations $\mathcal{D}={\{{\mathbf{x}}^{(i)}\}}_{i=1}^{N}$ are said to be i.i.d., for independently and identically distributed. Under the i.i.d. assumption, the probability of the datapoints given the parameters factorizes as a product of individual datapoint probabilities. The logprobability assigned to the data by the model is therefore given by:
$\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})$  $={\displaystyle \sum _{\mathbf{x}\in \mathcal{D}}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (1.10) 
1.6.2 Maximum Likelihood and Minibatch SGD
The most common criterion for probabilistic models is maximum loglikelihood (ML). As we will explain, maximization of the loglikelihood criterion is equivalent to minimization of a Kullback Leibler divergence between the data and model distributions.
Under the ML criterion, we attempt to find the parameters $\bm{\theta}$ that maximize the sum, or equivalently the average, of the logprobabilities assigned to the data by the model. With i.i.d. dataset $\mathcal{D}$ of size ${N}_{\mathcal{D}}$, the maximum likelihood objective is to maximize the logprobability given by equation (1.10).
Using calculus’ chain rule and automatic differentiation tools, we can efficiently compute gradients of this objective, i.e. the first derivatives the objective w.r.t. its parameters $\bm{\theta}$. We can use such gradients to iteratively hillclimb to a local optimum of the ML objective. If we compute such gradients using all datapoints, ${\nabla}_{\bm{\theta}}\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})$, then this is known as batch gradient descent. Computation of this derivative is, however, an expensive operation for large dataset size ${N}_{\mathcal{D}}$, since it scales linearly with ${N}_{\mathcal{D}}$.
A more efficient method for optimization is stochastic gradient descent (SGD) (section 6.3), which uses randomly drawn minibatches of data $\mathcal{M}\subset \mathcal{D}$ of size ${N}_{\mathcal{M}}$. With such minibatches we can form an unbiased estimator of the ML criterion:
$\frac{1}{{N}_{\mathcal{D}}}}\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})\simeq {\displaystyle \frac{1}{{N}_{\mathcal{M}}}}\mathrm{log}{p}_{\bm{\theta}}(\mathcal{M})={\displaystyle \frac{1}{{N}_{\mathcal{M}}}}{\displaystyle \sum _{\mathbf{x}\in \mathcal{M}}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (1.11) 
The $\simeq $ symbol means that one of the two sides is an unbiased estimator of the other side. So one side (in this case the righthand side) is a random variable due to some noise source, and the two sides are equal when averaged over the noise distribution. The noise source, in this case, is the randomly drawn minibatch of data $\mathcal{M}$. The unbiased estimator $\mathrm{log}{p}_{\bm{\theta}}(\mathcal{M})$ is differentiable, yielding the unbiased stochastic gradients:
$\frac{1}{{N}_{\mathcal{D}}}}{\nabla}_{\bm{\theta}}\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})\simeq {\displaystyle \frac{1}{{N}_{\mathcal{M}}}}{\nabla}_{\bm{\theta}}\mathrm{log}{p}_{\bm{\theta}}(\mathcal{M})={\displaystyle \frac{1}{{N}_{\mathcal{M}}}}{\displaystyle \sum _{\mathbf{x}\in \mathcal{M}}}{\nabla}_{\bm{\theta}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (1.12) 
These gradients can be plugged into stochastic gradientbased optimizers; see section 6.3 for further discussion. In a nutshell, we can optimize the objective function by repeatedly taking small steps in the direction of the stochastic gradient.
1.6.3 Bayesian inference
1.7 Learning and Inference in Deep Latent Variable Models
1.7.1 Latent Variables
We can extend fullyobserved directed models, discussed in the previous section, into directed models with latent variables. Latent variables are variables that are part of the model, but which we don’t observe, and are therefore not part of the dataset. We typically use $\mathbf{z}$ to denote such latent variables. In case of unconditional modeling of observed variable $\mathbf{x}$, the directed graphical model would then represent a joint distribution ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ over both the observed variables $\mathbf{x}$ and the latent variables $\mathbf{z}$. The marginal distribution over the observed variables ${p}_{\bm{\theta}}(\mathbf{x})$, is given by:
${p}_{\bm{\theta}}(\mathbf{x})={\displaystyle \int {p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathit{d}\mathbf{z}}$  (1.13) 
This is also called the (single datapoint) marginal likelihood or the model evidence, when taking as a function of $\bm{\theta}$.
Such an implicit distribution over $\mathbf{x}$ can be quite flexible. If $\mathbf{z}$ is discrete and ${p}_{\bm{\theta}}(\mathbf{x}\mathbf{z})$ is a Gaussian distribution, then ${p}_{\bm{\theta}}(\mathbf{x})$ is a mixtureofGaussians distribution. For continuous $\mathbf{z}$ (which is generally more efficient to work with due to the reparameterization trick), ${p}_{\bm{\theta}}(\mathbf{x})$ can be seen as an infinite mixture, which are potentially more powerful than discrete mixtures. Such marginal distributions ares also called compound probability distributions.
1.7.2 Deep Latent Variable Models
We use the term deep latent variable model (DLVM) to denote a latent variable model ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ whose distributions are parameterized by neural networks. Such a model can be conditioned on some context, like ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}\mathbf{y})$. DLVMs are especially useful when inference is computationally affordable, such as representation learning and artificial creativity
One important advantage of DLVMs, is that even when each factor (prior or conditional distribution) in the directed model is relatively simple (such as conditional Gaussian), the marginal distribution ${p}_{\bm{\theta}}(\mathbf{x})$ can be very complex, i.e. contain almost arbitrary dependencies. This expressivity makes deep latentvariable models attractive for approximating complicated underlying distributions ${p}^{*}(\mathbf{x})$.
Perhaps the simplest, and most common, graphical model with latent variables is one that is specified as factorization with the following structure:
${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})={p}_{\bm{\theta}}(\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x}\mathbf{z})$  (1.14) 
where ${p}_{\bm{\theta}}(\mathbf{z})$ and/or ${p}_{\bm{\theta}}(\mathbf{x}\mathbf{z})$ are specified. The distribution $p(\mathbf{z})$ is often called the prior distribution over $\mathbf{z}$, since it is not conditioned on any observations.
1.7.3 Example DLVM for multivariate Bernoulli data
A simple example DLVM, used in (Kingma and Welling, 2013) for binary data $\mathbf{x}$, is with a spherical Gaussian latent space, and a factorized Bernoulli observation model:
$p(\mathbf{z})$  $=\mathcal{N}(\mathbf{z};0,\mathbf{I})$  (1.15)  
$\mathbf{p}$  $={\text{DecoderNeuralNet}}_{\bm{\theta}}(\mathbf{z})$  (1.16)  
$\mathrm{log}p(\mathbf{x}\mathbf{z})$  $={\displaystyle \sum _{j=1}^{D}}\mathrm{log}p({x}_{j}\mathbf{z})={\displaystyle \sum _{j=1}^{D}}\mathrm{log}\text{Bernoulli}({x}_{j};{p}_{j})$  (1.17)  
$={\displaystyle \sum _{j=1}^{D}}{x}_{j}\mathrm{log}{p}_{j}+(1{x}_{j})\mathrm{log}(1{p}_{j})$  (1.18) 
where $\forall {p}_{j}\in \mathbf{p}:0\le {p}_{j}\le 1$ (e.g. implemented through a sigmoid nonlinearity as the last layer of the ${\text{DecoderNeuralNet}}_{\bm{\theta}}(.)$), where $D$ is the dimensionality of $\mathbf{x}$, and $\text{Bernoulli}(.;p)$ is the probability mass function (PMF) of the Bernoulli distribution.
1.8 Intractabilities
The main difficulty of maximum likelihood learning in DLVMs is that the marginal probability of data under the model is typically intractable. This is due to the integral in equation (1.13) for computing the marginal likelihood (or model evidence), ${p}_{\bm{\theta}}(\mathbf{x})=\int {p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathit{d}\mathbf{z}$, not having an analytic solution or efficient estimator. Due to this intractability, we cannot differentiate it w.r.t. its parameters and optimize it, as we can with fully observed models.
The intractability of ${p}_{\bm{\theta}}(\mathbf{x})$, is related to the intractability of the posterior distribution ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$. Note that the joint distribution ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ is efficient to compute, and that the densities are related through the basic identity:
${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})={\displaystyle \frac{{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})}{{p}_{\bm{\theta}}(\mathbf{x})}}$  (1.19) 
Since ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ is tractable to compute, a tractable marginal likelihood ${p}_{\bm{\theta}}(\mathbf{x})$ leads to a tractable posterior ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$, and vice versa. Both are intractable in DLVMs.
Approximate inference techniques (see also section 6.2) allow us to approximate the posterior ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$ and the marginal likelihood ${p}_{\bm{\theta}}(\mathbf{x})$ in DLVMs. Traditional inference methods are relatively expensive. Such methods, for example, often require a perdatapoint optimization loop, or yield bad posterior approximations. We would like to avoid such expensive procedures.
Likewise, the posterior over the parameters of (directed models parameterized with) neural networks, $p(\bm{\theta}\mathcal{D})$, is generally intractable to compute exactly, and requires approximate inference techniques.
Chapter 2 Variational Autoencoders
In this chapter we explain the basics of variational autoencoders (VAEs).
2.1 Encoder or Approximate Posterior
In the previous chapter, we introduced deep latentvariable models (DLVMs), and the problem of estimating the loglikelihood and posterior distributions in such models. The framework of variational autoencoders (VAEs) provides a computationally efficient way for optimizing DLVMs jointly with a corresponding inference model using SGD.
To turn the DLVM’s intractable posterior inference and learning problems into tractable problems, we introduce a parametric inference model ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$. This model is also called an encoder or recognition model. With $\mathit{\varphi}$ we indicate the parameters of this inference model, also called the variational parameters. We optimize the variational parameters $\mathit{\varphi}$ such that:
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\approx {p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$  (2.1) 
As we will explain, this approximation to the posterior help us optimize the marginal likelihood.
Like a DLVM, the inference model can be (almost) any directed graphical model:
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})={q}_{\mathit{\varphi}}({\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{M}\mathbf{x})={\displaystyle \prod _{j=1}^{M}}{q}_{\mathit{\varphi}}({\mathbf{z}}_{j}Pa({\mathbf{z}}_{j}),\mathbf{x})$  (2.2) 
where $Pa({\mathbf{z}}_{j})$ is the set of parent variables of variable ${\mathbf{z}}_{j}$ in the directed graph. And also similar to a DLVM, the distribution ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ can be parameterized using deep neural networks. In this case, the variational parameters $\mathit{\varphi}$ include the weights and biases of the neural network. For example:
$(\bm{\mu},\mathrm{log}\bm{\sigma})$  $={\text{EncoderNeuralNet}}_{\mathit{\varphi}}(\mathbf{x})$  (2.3)  
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  $=\mathcal{N}(\mathbf{z};\bm{\mu},\text{diag}(\bm{\sigma}))$  (2.4) 
Typically, we use a single encoder neural network to perform posterior inference over all of the datapoints in our dataset. This can be contrasted to more traditional variational inference methods where the variational parameters are not shared, but instead separately and iteratively optimized per datapoint. The strategy used in VAEs of sharing variational parameters across datapoints is also called amortized variational inference (Gershman and Goodman, 2014). With amortized inference we can avoid a perdatapoint optimization loop, and leverage the efficiency of SGD.
2.2 Evidence Lower Bound (ELBO)
The optimization objective of the variational autoencoder, like in other variational methods, is the evidence lower bound, abbreviated as ELBO. An alternative term for this objective is variational lower bound. Typically, the ELBO is derived through Jensen’s inequality. Here we will use an alternative derivation that avoids Jensen’s inequality, providing greater insight about its tightness.
For any choice of inference model ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, including the choice of variational parameters $\mathit{\varphi}$, we have:
$\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  $={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})\right]$  (2.5)  
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}\left[{\displaystyle \frac{{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})}{{p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})}}\right]\right]$  (2.6)  
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}\left[{\displaystyle \frac{{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})}{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}}{\displaystyle \frac{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}{{p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})}}\right]\right]$  (2.7)  
$=\underset{\begin{array}{c}={\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})\\ \text{(ELBO)}\end{array}}{\underset{\u23df}{{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}\left[{\displaystyle \frac{{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})}{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}}\right]\right]}}+\underset{={D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))}{\underset{\u23df}{{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}\left[{\displaystyle \frac{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}{{p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})}}\right]\right]}}$  (2.8) 
The second term in eq. (2.8) is the KullbackLeibler (KL) divergence between ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ and ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$, which is nonnegative:
${D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))\ge 0$  (2.9) 
and zero if, and only if, ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ equals the true posterior distribution.
The first term in eq. (2.8) is the variational lower bound, also called the evidence lower bound (ELBO):
${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.10) 
Due to the nonnegativity of the KL divergence, the ELBO is a lower bound on the loglikelihood of the data.
${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  $=\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x}){D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))$  (2.11)  
$\le \mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (2.12) 
So, interestingly, the KL divergence ${D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))$ determines two ’distances’:

1.
By definition, the KL divergence of the approximate posterior from the true posterior;

2.
The gap between the ELBO ${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ and the marginal likelihood $\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$; this is also called the tightness of the bound. The better ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ approximates the true (posterior) distribution ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$, in terms of the KL divergence, the smaller the gap.
2.2.1 Two for One
By looking at equation 2.11, it can be understood that maximization of the ELBO ${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ w.r.t. the parameters $\bm{\theta}$ and $\mathit{\varphi}$, will concurrently optimize the two things we care about:

1.
It will approximately maximize the marginal likelihood ${p}_{\bm{\theta}}(\mathbf{x})$. This means that our generative model will become better.

2.
It will minimize the KL divergence of the approximation ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ from the true posterior ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$, so ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ becomes better.
2.3 Stochastic GradientBased Optimization of the ELBO
An important property of the ELBO, is that it allows joint optimization w.r.t. all parameters ($\mathit{\varphi}$ and $\bm{\theta}$) using stochastic gradient descent (SGD). We can start out with random initial values of $\mathit{\varphi}$ and $\bm{\theta}$, and stochastically optimize their values until convergence.
Given a dataset with i.i.d. data, the ELBO objective is the sum (or average) of individualdatapoint ELBO’s:
${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathcal{D})={\displaystyle \sum _{\mathbf{x}\in \mathcal{D}}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  (2.13) 
The individualdatapoint ELBO, and its gradient ${\nabla}_{\bm{\theta},\mathit{\varphi}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ is, in general, intractable. However, good unbiased estimators ${\stackrel{~}{\nabla}}_{\bm{\theta},\mathit{\varphi}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ exist, as we will show, such that we can still perform minibatch SGD.
Unbiased gradients of the ELBO w.r.t. the generative model parameters are simple to obtain:
${\nabla}_{\bm{\theta}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  $={\nabla}_{\bm{\theta}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.14)  
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[{\nabla}_{\bm{\theta}}(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}))\right]$  (2.15)  
$\simeq {\nabla}_{\bm{\theta}}(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}))$  (2.16)  
$={\nabla}_{\bm{\theta}}(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$  (2.17) 
The last line (eq. (2.17)) is a simple Monte Carlo estimator of the second line (eq. (2.15)), where $\mathbf{z}$ in (2.17), the last line, is a random sample from ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$.
Unbiased gradients w.r.t. the variational parameters $\mathit{\varphi}$ are more difficult to obtain, since the ELBO’s expectation is taken w.r.t. the distribution ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, which is a function of $\mathit{\varphi}$. I.e., in general:
${\nabla}_{\mathit{\varphi}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  $={\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.18)  
$\ne {\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[{\nabla}_{\mathit{\varphi}}(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}))\right]$  (2.19) 
In case of continuous latent variables, we can use a reparameterization trick for computing unbiased estimated of ${\nabla}_{\bm{\theta},\mathit{\varphi}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$, as we will now discuss. This stochastic estimate allows us to optimize the ELBO using SGD; see algorithm 1. See section 2.9.1 for a discussion of variational methods for discrete latent variables.
2.4 Reparameterization Trick
For continuous latent variables and a differentiable encoder and generative model, the ELBO can be straightforwardly differentiated w.r.t. both $\mathit{\varphi}$ and $\bm{\theta}$ through a change of variables, also called the reparameterization trick (Kingma and Welling (2013) and Rezende et al. (2014)).
2.4.1 Change of variables
First, we express the random variable $\mathbf{z}\sim {q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ as some differentiable (and invertible) transformation of another random variable $\mathit{\u03f5}$, given $\mathbf{z}$ and $\mathit{\varphi}$:
$\mathbf{z}=\mathbf{g}(\mathit{\u03f5},\mathit{\varphi},\mathbf{x})$  (2.20) 
where the distribution of random variable $\mathit{\u03f5}$ is independent of $\mathbf{x}$ or $\mathit{\varphi}$.
2.4.2 Gradient of expectation under change of variable
Given such a change of variable, expectations can be rewritten in terms of $\mathit{\u03f5}$,
${\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[f(\mathbf{z})\right]={\mathbb{E}}_{p(\mathit{\u03f5})}\left[f(\mathbf{z})\right]$  (2.21) 
where $\mathbf{z}=\mathbf{g}(\mathit{\u03f5},\mathit{\varphi},\mathbf{x})$. and the expectation and gradient operators become commutative, and we can form a simple Monte Carlo estimator:
${\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[f(\mathbf{z})\right]$  $={\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{p(\mathit{\u03f5})}\left[f(\mathbf{z})\right]$  (2.22)  
$={\mathbb{E}}_{p(\mathit{\u03f5})}\left[{\nabla}_{\mathit{\varphi}}f(\mathbf{z})\right]$  (2.23)  
$\simeq {\nabla}_{\mathit{\varphi}}f(\mathbf{z})$  (2.24) 
where in the last line, $\mathbf{z}=\mathbf{g}(\mathit{\varphi},\mathbf{x},\mathit{\u03f5})$ with random noise sample $\mathit{\u03f5}\sim p(\mathit{\u03f5})$. See figure 2.3 for an illustration and further clarification.
2.4.3 Gradient of ELBO
Under the reparameterization, we can replace an expectation w.r.t. ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ with one w.r.t. $p(\mathit{\u03f5})$. The ELBO can be rewritten as:
${\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  $={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.25)  
$={\mathbb{E}}_{p(\mathit{\u03f5})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.26) 
where $\mathbf{z}=g(\mathit{\u03f5},\mathit{\varphi},\mathbf{x})$.
As a result we can form a simple Monte Carlo estimator ${\stackrel{~}{\mathcal{L}}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ of the individualdatapoint ELBO where we use a single noise sample $\mathit{\u03f5}$ from $p(\mathit{\u03f5})$:
$\mathit{\u03f5}$  $\sim p(\mathit{\u03f5})$  (2.27)  
$\mathbf{z}$  $=\mathbf{g}(\mathit{\varphi},\mathbf{x},\mathit{\u03f5})$  (2.28)  
${\stackrel{~}{\mathcal{L}}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  $=\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  (2.29) 
This series of operations can be expressed as a symbolic graph in software like Tensorflow, and effortlessly differentiated w.r.t. the parameters $\bm{\theta}$ and $\mathit{\varphi}$. The resulting gradient ${\nabla}_{\mathit{\varphi}}{\stackrel{~}{\mathcal{L}}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$ is used to optimize the ELBO using minibatch SGD. See algorithm 1. This algorithm was originally referred to as the AutoEncoding Variational Bayes (AEVB) algorithm by Kingma and Welling (2013). More generally, the reparameterized ELBO estimator was referred to as the Stochastic Gradient Variational Bayes (SGVB) estimator. This estimator can also be used to estimate a posterior over the model parameters, as explained in the appendix of (Kingma and Welling, 2013).
2.4.3.1 Unbiasedness
This gradient is an unbiased estimator of the exact singledatapoint ELBO gradient; when averaged over noise $\mathit{\u03f5}\sim p(\mathit{\u03f5})$, this gradient equals the singledatapoint ELBO gradient:
${\mathbb{E}}_{p(\mathit{\u03f5})}\left[{\nabla}_{\bm{\theta},\mathit{\varphi}}{\stackrel{~}{\mathcal{L}}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x};\mathit{\u03f5})\right]$  $={\mathbb{E}}_{p(\mathit{\u03f5})}\left[{\nabla}_{\bm{\theta},\mathit{\varphi}}(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}))\right]$  (2.30)  
$={\nabla}_{\bm{\theta},\mathit{\varphi}}({\mathbb{E}}_{p(\mathit{\u03f5})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right])$  (2.31)  
$={\nabla}_{\bm{\theta},\mathit{\varphi}}{\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x})$  (2.32) 
2.4.4 Computation of $\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$
Computation of the (estimator of) the ELBO requires computation of the density $\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, given a value of $\mathbf{x}$, and given a value of $\mathbf{z}$ or equivalently $\mathit{\u03f5}$. This logdensity is a simple computation, as long as we choose the right transformation $\mathbf{g}()$.
Note that we typically know the density $p(\mathit{\u03f5})$, since this is the density of the chosen noise distribution. As long as $\mathbf{g}(.)$ is an invertible function, the densities of $\mathit{\u03f5}$ and $\mathbf{z}$ are related as:
$\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})=\mathrm{log}p(\mathit{\u03f5})\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})$  (2.33) 
where the second term is the log of the absolute value of the determinant of the Jacobian matrix $(\partial \mathbf{z}/\partial \mathit{\u03f5})$:
$\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})=\mathrm{log}\leftdet\left({\displaystyle \frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}}\right)\right$  (2.34) 
We call this the logdeterminant of the transformation from $\mathit{\u03f5}$ to $\mathbf{z}$. We use the notation $\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})$ to make explicit that this logdeterminant, similar to $\mathbf{g}()$, is a function of $\mathbf{x}$, $\mathit{\u03f5}$ and $\mathit{\varphi}$. The Jacobian matrix contains all first derivatives of the transformation from $\mathit{\u03f5}$ to $\mathbf{z}$:
$\frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}}={\displaystyle \frac{\partial ({z}_{1},\mathrm{\dots},{z}_{k})}{\partial ({\u03f5}_{1},\mathrm{\dots},{\u03f5}_{k})}}=\left(\begin{array}{ccc}\hfill \frac{\partial {z}_{1}}{\partial {\u03f5}_{1}}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \frac{\partial {z}_{1}}{\partial {\u03f5}_{k}}\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill \frac{\partial {z}_{k}}{\partial {\u03f5}_{1}}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \frac{\partial {z}_{k}}{\partial {\u03f5}_{k}}\hfill \end{array}\right)$  (2.35) 
As we will show, we can build very flexible transformations $\mathbf{g}()$ for which $\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})$ is simple to compute, resulting in highly flexible inference models ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$.
2.5 Factorized Gaussian posteriors
A common choice is a simple factorized Gaussian encoder
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})=\mathcal{N}(\mathbf{z};\bm{\mu},\text{diag}({\bm{\sigma}}^{2}))$:
$(\bm{\mu},\mathrm{log}\bm{\sigma})$  $={\text{EncoderNeuralNet}}_{\mathit{\varphi}}(\mathbf{x})$  (2.36)  
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  $={\displaystyle \prod _{i}}{q}_{\mathit{\varphi}}({z}_{i}\mathbf{x})={\displaystyle \prod _{i}}\mathcal{N}({z}_{i};{\mu}_{i},{\sigma}_{i}^{2})$  (2.37) 
where $\mathcal{N}({z}_{i};{\mu}_{i},{\sigma}_{i}^{2})$ is the PDF of the univariate Gaussian distribution. After reparameterization, we can write:
$\mathit{\u03f5}$  $\sim \mathcal{N}(0,\mathbf{I})$  (2.38)  
$(\bm{\mu},\mathrm{log}\bm{\sigma})$  $={\text{EncoderNeuralNet}}_{\mathit{\varphi}}(\mathbf{x})$  (2.39)  
$\mathbf{z}$  $=\bm{\mu}+\bm{\sigma}\odot \mathit{\u03f5}$  (2.40) 
where $\odot $ is the elementwise product. The Jacobian of the transformation from $\mathit{\u03f5}$ to $\mathbf{z}$ is:
$\frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}}=\text{diag}(\bm{\sigma}),$  (2.41) 
i.e. a diagonal matrix with the elements of $\bm{\sigma}$ on the diagonal. The determinant of a diagonal (or more generally, triangular) matrix is the product of its diagonal terms. The log determinant of the Jacobian is therefore:
$\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})=\mathrm{log}\leftdet\left({\displaystyle \frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}}\right)\right={\displaystyle \sum _{i}}\mathrm{log}{\sigma}_{i}$  (2.42) 
and the posterior density is:
$\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  $=\mathrm{log}p(\mathit{\u03f5})\mathrm{log}{d}_{\mathit{\varphi}}(\mathbf{x},\mathit{\u03f5})$  (2.43)  
$={\displaystyle \sum _{i}}\mathrm{log}\mathcal{N}({\u03f5}_{i};0,1)\mathrm{log}{\sigma}_{i}$  (2.44) 
when $z=g(\mathit{\u03f5},\varphi ,\mathbf{x})$.
2.5.1 Fullcovariance Gaussian posterior
The factorized Gaussian posterior can be extended to a Gaussian with full covariance:
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})=\mathcal{N}(\mathbf{z};\bm{\mu},\mathbf{\Sigma})$  (2.45) 
A reparameterization of this distribution is given by:
$\mathit{\u03f5}$  $\sim \mathcal{N}(0,\mathbf{I})$  (2.46)  
$\mathbf{z}$  $=\bm{\mu}+\mathbf{L}\mathit{\u03f5}$  (2.47) 
where $\mathbf{L}$ is a lower (or upper) triangular matrix, with nonzero entries on the diagonal. The offdiagonal elements define the correlations (covariances) of the elements in $\mathbf{z}$.
The reason for this parameterization of the fullcovariance Gaussian, is that the Jacobian determinant is remarkably simple. The Jacobian in this case is trivial: $\frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}=\mathbf{L}$. Note that the determinant of a triangular matrix is the product of its diagonal elements. Therefore, in this parameterization:
$\mathrm{log}det({\displaystyle \frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}})$  $={\displaystyle \sum _{i}}\mathrm{log}{L}_{ii}$  (2.48) 
And the logdensity of the posterior is:
$\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  $=\mathrm{log}p(\mathit{\u03f5}){\displaystyle \sum _{i}}\mathrm{log}{L}_{ii}$  (2.49) 
This parameterization corresponds to the Cholesky decomposition $\mathbf{\Sigma}={\mathrm{\mathbf{L}\mathbf{L}}}^{T}$ of the covariance of $\mathbf{z}$:
$\mathbf{\Sigma}$  $=\mathbb{E}\left[(\mathbf{z}\mathbb{E}[\mathbf{z}]){(\mathbf{z}\mathbb{E}[\mathbf{z}])}^{T}\right]$  (2.50)  
$=\mathbb{E}\left[\mathbf{L}\mathit{\u03f5}{(\mathbf{L}\mathit{\u03f5})}^{T}\right]=\mathbf{L}\mathbb{E}\left[\mathit{\u03f5}{\mathit{\u03f5}}^{T}\right]{\mathbf{L}}^{T}$  (2.51)  
$={\mathrm{\mathbf{L}\mathbf{L}}}^{T}$  (2.52) 
Note that $\mathbb{E}\left[\mathit{\u03f5}{\mathit{\u03f5}}^{T}\right]=\mathbf{I}$ since $\mathit{\u03f5}\sim \mathcal{N}(0,\mathbf{I})$.
One way to build a matrix $\mathbf{L}$ with the desired properties, namely triangularity and nonzero diagonal entries, is by constructing it as follows:
$(\bm{\mu},\mathrm{log}\bm{\sigma},{\mathbf{L}}^{\prime})\leftarrow {\text{EncoderNeuralNet}}_{\mathit{\varphi}}(\mathbf{x})$  (2.53)  
$\mathbf{L}\leftarrow {\mathbf{L}}_{mask}\odot {\mathbf{L}}^{\prime}+\text{diag}(\bm{\sigma})$  (2.54) 
and then proceeding with $\mathbf{z}=\bm{\mu}+\mathbf{L}\mathit{\u03f5}$ as described above. ${\mathbf{L}}_{mask}$ is a masking matrix with zeros on and above the diagonal, and ones below the diagonal. Note that due to the masking $\mathbf{L}$, the Jacobian matrix $(\partial \mathbf{z}/\partial \mathit{\u03f5})$ is triangular with the values of $\bm{\sigma}$ on the diagonal. The logdeterminant is therefore identical to the factorized Gaussian case:
$\mathrm{log}\leftdet\left({\displaystyle \frac{\partial \mathbf{z}}{\partial \mathit{\u03f5}}}\right)\right={\displaystyle \sum _{i}}\mathrm{log}{\sigma}_{i}$  (2.55) 
More generally, we can replace $\mathbf{z}=\mathbf{L}\mathit{\u03f5}+\bm{\mu}$ with a chain of (differentiable and nonlinear) transformations; as long as the Jacobian of each step in the chain is triangular with nonzero diagonal entries, the log determinant remains simple. This principle is used by inverse autoregressive flow (IAF) as explored by Kingma et al. (2016) and discussed in chapter 3.
2.6 Estimation of the Marginal Likelihood
After training a VAE, we can estimate the probability of data under the model using an importance sampling technique, as originally proposed by Rezende et al. (2014). The marginal likelhood of a datapoint can be written as:
$\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})=\mathrm{log}{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})/{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.56) 
Taking random samples from ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, a Monte Carlo estimator of this is:
$\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})\approx \mathrm{log}{\displaystyle \frac{1}{L}}{\displaystyle \sum _{l=1}^{L}}{p}_{\bm{\theta}}(\mathbf{x},{\mathbf{z}}^{(l)})/{q}_{\mathit{\varphi}}({\mathbf{z}}^{(l)}\mathbf{x})$  (2.57) 
where each ${\mathbf{z}}^{(l)}\sim {q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ is a random sample from the inference model. By making $L$ large, the approximation becomes a better estimate of the marginal likelihood, and in fact since this is a Monte Carlo estimator, for $L\to \mathrm{\infty}$ this converges to the actual marginal likelihood.
Notice that when setting $L=1$, this equals the ELBO estimator of the VAE. We can also use the estimator of 2.57 as our objective function; this is the objective used in importance weighted autoencoders (Burda et al., 2015) (IWAE). In that paper, it was also shown that the objective has increasing tightness for increasing value of $L$. It was later shown by Cremer and Duvenaud (2017) that the IWAE objective can be reinterpreted as an ELBO objective with a particular inference model. The downside of these approaches for optimizing a tighter bound, is that importance weighted estimates have notoriously bad scaling properties to highdimensional latent spaces.
2.7 Marginal Likelihood and ELBO as KL Divergences
One way to improve the potential tightness of the bound, is increasing the flexibility of the generative model. This can be understood through a connection between the ELBO and the KL divergence.
With i.i.d. dataset $\mathcal{D}$ of size ${N}_{\mathcal{D}}$, the maximum likelihood criterion is:
$\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})$  $={\displaystyle \frac{1}{{N}_{\mathcal{D}}}}{\displaystyle \sum _{\mathbf{x}\in \mathcal{D}}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (2.58)  
$={\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})\right]$  (2.59) 
where ${q}_{\mathcal{D}}(\mathbf{x})$ is the empirical (data) distribution, which is a mixture distribution:
${q}_{\mathcal{D}}(\mathbf{x})={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}{q}_{\mathcal{D}}^{(i)}(\mathbf{x})$  (2.60) 
where each component ${q}_{\mathcal{D}}^{(i)}(\mathbf{x})$ typically corresponds to a Dirac delta distribution centered at value ${\mathbf{x}}^{(i)}$ in case of continuous data, or a discrete distribution with all probability mass concentrated at value ${\mathbf{x}}^{(i)}$ in case of discrete data. The Kullback Leibler (KL) divergence between the data and model distributions, can be rewritten as the negative loglikelihood, plus a constant:
${D}_{KL}({q}_{\mathcal{D}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))$  $={\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})\right]+{\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[\mathrm{log}{q}_{\mathcal{D}}(\mathbf{x})\right]$  (2.61)  
$=\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})+\text{constant}$  (2.62) 
where $\text{constant}=\mathscr{H}({q}_{\mathcal{D}}(\mathbf{x}))$. So minimization of the KL divergence above is equivalent to maximization of the data loglikelihood $\mathrm{log}{p}_{\bm{\theta}}(\mathcal{D})$.
Taking the combination of the empirical data distribution ${q}_{\mathcal{D}}(\mathbf{x})$ and the inference model, we get a joint distribution over data $\mathbf{x}$ and latent variables $\mathbf{z}$: ${q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z})={q}_{\mathcal{D}}(\mathbf{x})q(\mathbf{z}\mathbf{x})$.
The KL divergence of this joint distribution, can be rewritten as the negative ELBO, plus a constant:
${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$  (2.63)  
$={\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]\mathrm{log}{q}_{\mathcal{D}}(\mathbf{x})\right]$  (2.64)  
$={\mathcal{L}}_{\bm{\theta},\mathit{\varphi}}(\mathcal{D})+\text{constant}$  (2.65) 
where $\text{constant}=\mathscr{H}({q}_{\mathcal{D}}(\mathbf{x}))$. So maximization of the ELBO, is equivalent to the minimization of this KL divergence ${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$. The relationship between the ML and ELBO objectives can be summarized in the following simple equation:
${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$  (2.66)  
$={D}_{KL}({q}_{\mathcal{D}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))+{\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[{D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))\right]$  (2.67)  
$\ge {D}_{KL}({q}_{\mathcal{D}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))$  (2.68) 
One additional perspective is that the ELBO can be viewed as a maximum likelihood objective in an augmented space. For some fixed choice of encoder ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, we can view the joint distribution ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ as an augmented empirical distribution over the original data $\mathbf{x}$ and (stochastic) auxiliary features $\mathbf{z}$ associated with each datapoint. The model ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ then defines a joint model over the original data, and the auxiliary features.
2.8 Challenges
2.8.1 Optimization issues
In our work, consistent with findings in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016), we found that stochastic optimization with the unmodified lower bound objective can gets stuck in an undesirable stable equilibrium. At the start of training, the likelihood term $\mathrm{log}p(\mathbf{x}\mathbf{z})$ is relatively weak, such that an initially attractive state is where $q(\mathbf{z}\mathbf{x})\approx p(\mathbf{z})$. In this state, encoder gradients have a relatively low signaltonoise ratio, resulting in a stable equilibrium from which it is difficult to escape. The solution proposed in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016) is to use an optimization schedule where the weights of the latent cost ${D}_{KL}(q(\mathbf{z}\mathbf{x})p(\mathbf{z}))$ is slowly annealed from $0$ to $1$ over many epochs.
An alternative proposed in (Kingma et al., 2016) is the method of free bits: a modification of the ELBO objective, that ensures that on average, a certain minimum number of bits of information are encoded per latent variable, or per group of latent variables.
The latent dimensions are divided into the $K$ groups. We then use the following minibatch objective, which ensures that using less than $\lambda $ nats of information per subset $j$ (on average per minibatch $\mathcal{M}$) is not advantageous:
${\stackrel{~}{\mathcal{L}}}_{\lambda}$  $={\mathbb{E}}_{\mathbf{x}\sim \mathcal{M}}\left[{\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[\mathrm{log}p(\mathbf{x}\mathbf{z})\right]\right]$  (2.69)  
${\displaystyle \sum _{j=1}^{K}}\text{maximum}(\lambda ,{\mathbb{E}}_{\mathbf{x}\sim \mathcal{M}}\left[{D}_{KL}(q({z}_{j}\mathbf{x})p({z}_{j}))\right]$  (2.70) 
Since increasing the latent information is generally advantageous for the first (unaffected) term of the objective (often called the negative reconstruction error), this results in ${\mathbb{E}}_{\mathbf{x}\sim \mathcal{M}}\left[{D}_{KL}(q({\mathbf{z}}_{j}\mathbf{x})p({\mathbf{z}}_{j}))\right]\ge \lambda $ for all $j$, in practice. In Kingma et al. (2016) it was found that the method worked well for a fairly wide range of values ($\lambda \in [0.125,0.25,0.5,1,2]$), resulting in significant improvement in the resulting loglikelihood on a benchmark result.
2.8.2 Blurriness of generative model
In section 2.7 we saw that optimizing the ELBO is equivalent to minimizing ${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$. If a perfect fit between ${q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z})$ and ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ is not possible, then the variance of ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$ and ${p}_{\bm{\theta}}(\mathbf{x})$ will end up larger than the variance ${q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z})$ and the data ${q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x})$. This is due to the direction of the KL divergence; if there are values of $(\mathbf{x},\mathbf{z})$ which are likely under ${q}_{\mathcal{D},\mathit{\varphi}}$ but not under ${p}_{\bm{\theta}}$, the term ${\mathbb{E}}_{{q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\right]$ will go to infinity. However, the reverse is not true: the generative model is only slightly penalized when putting probability mass on values of $(\mathbf{x},\mathbf{z})$ with no support under ${q}_{\mathcal{D},\mathit{\varphi}}$.
Issues with ’blurriness’ can thus can be countered by choosing a sufficiently flexible inference model, and/or a sufficiently flexible generative model. In the next two chapters we will discuss techniques for constructing flexible inference models and flexible generative models.
2.9 Related prior and concurrent work
Here we briefly discuss relevant literature prior to and concurrent with work in (Kingma and Welling, 2013).
The wakesleep algorithm (Hinton et al., 1995) is, to the best of our knowledge, the only other online learning method in the literature that is applicable to the same general class of continuous latent variable models. Like our method, the wakesleep algorithm employs a recognition model that approximates the true posterior. A drawback of the wakesleep algorithm is that it requires a concurrent optimization of two objective functions, which together do not correspond to optimization of (a bound of) the marginal likelihood. An advantage of wakesleep is that it also applies to models with discrete latent variables. WakeSleep has the same computational complexity as AEVB per datapoint.
Stochastic variational inference Hoffman et al. (2013) has received increasing interest. Blei et al. (2012) introduced a control variate schemes to reduce the variance of the score function gradient estimator, and applied the estimator to exponential family approximations of the posterior. In (Ranganath et al., 2014) some general methods, e.g. a control variate scheme, were introduced for reducing the variance of the original gradient estimator. In Salimans and Knowles (2013), a similar reparameterization as in this work was used in an efficient version of a stochastic variational inference algorithm for learning the natural parameters of exponentialfamily approximating distributions.
In Graves (2011) a similar estimator of the gradient is introduced; however the estimator of the variance is not an unbiased estimator w.r.t. the ELBO gradient.
The VAE training algorithm exposes a connection between directed probabilistic models (trained with a variational objective) and autoencoders. A connection between linear autoencoders and a certain class of generative linearGaussian models has long been known. In (Roweis, 1998) it was shown that PCA corresponds to the maximumlikelihood (ML) solution of a special case of the linearGaussian model with a prior $p(\mathbf{z})=\mathcal{N}(0,\mathbf{I})$ and a conditional distribution $p(\mathbf{x}\mathbf{z})=\mathcal{N}(\mathbf{x};\mathrm{\mathbf{W}\mathbf{z}},\u03f5\mathbf{I})$, specifically the case with infinitesimally small $\u03f5$. In this limiting case, the posterior over the latent variables $p(\mathbf{z}\mathbf{x})$ is a Dirac delta distribution: $p(\mathbf{z}\mathbf{x})=\delta (\mathbf{z}{\mathbf{W}}^{\prime}\mathbf{x})$ where ${\mathbf{W}}^{\prime}={({\mathbf{W}}^{T}\mathbf{W})}^{1}{\mathbf{W}}^{T}$, i.e., given $\mathbf{W}$ and $\mathbf{x}$ there is no uncertainty about latent variable $\mathbf{z}$. Roweis (1998) then introduces an EMtype approach to learning $\mathbf{W}$. Much earlier work (Bourlard and Kamp, 1988) showed that optimization of linear autoencoders retrieves the principal components of data, from which it follows that learning linear autoencoders correspond to a specific method for learning the above case of linearGaussian probabilistic model of the data. However, this approach using linear autoencoders is limited to linearGaussian models, while our approach applies to a much broader class of continuous latent variable models.
When using neural networks for both the inference model and the generative model, the combination forms a type of autoencoder (Goodfellow et al., 2016) with a specific regularization term:
${\stackrel{~}{\mathcal{L}}}_{\bm{\theta},\mathit{\varphi}}(\mathbf{x};\mathit{\u03f5})=\underset{\text{Negative reconstruction error}}{\underset{\u23df}{\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x}\mathbf{z})}}+\underset{\text{Regularization terms}}{\underset{\u23df}{\mathrm{log}{p}_{\bm{\theta}}(\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}}$  (2.71) 
In an analysis of plain autoencoders (Vincent et al., 2010) it was shown that the training criterion of unregularized autoencoders corresponds to maximization of a lower bound (see the infomax principle (Linsker, 1989)) of the mutual information between input $X$ and latent representation $Z$. Maximizing (w.r.t. parameters) of the mutual information is equivalent to maximizing the conditional entropy, which is lower bounded by the expected loglikelihood of the data under the autoencoding model (Vincent et al., 2010), i.e. the negative reconstruction error. However, it is well known that this reconstruction criterion is in itself not sufficient for learning useful representations Bengio et al. (2013). Regularization techniques have been proposed to make autoencoders learn useful representations, such as denoising, contractive and sparse autoencoder variants (Bengio et al., 2013). The VAE objective contains a regularization term dictated by the variational bound, lacking the usual nuisance regularization hyperparameter required to learn useful representations. Related are also encoderdecoder architectures such as the predictive sparse decomposition (PSD) Kavukcuoglu et al. (2008), from which we drew some inspiration. Also relevant are the recently introduced Generative Stochastic Networks Bengio and ThibodeauLaufer (2013) where noisy autoencoders learn the transition operator of a Markov chain that samples from the data distribution. In Salakhutdinov and Larochelle (2010) a recognition model was employed for efficient learning with Deep Boltzmann Machines. These methods are targeted at either unnormalized models (i.e. undirected models like Boltzmann machines) or limited to sparse coding models, in contrast to our proposed algorithm for learning a general class of directed probabilistic models.
The proposed DARN method (Gregor et al., 2013), also learns a directed probabilistic model using an autoencoding structure, however their method applies to binary latent variables. In concurrent work, Rezende et al. (2014) also make the connection between autoencoders, directed probabilistic models and stochastic variational inference using the reparameterization trick we describe in (Kingma and Welling, 2013). Their work was developed independently of ours and provides an additional perspective on the VAE.
2.9.1 Score function estimator
An alternative unbiased stochastic gradient estimator of the ELBO is the score function estimator (Kleijnen and Rubinstein, 1996):
${\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[f(\mathbf{z})\right]$  $={\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})}\left[f(\mathbf{z}){\nabla}_{\mathit{\varphi}}\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})\right]$  (2.72)  
$\simeq f(\mathbf{z}){\nabla}_{\mathit{\varphi}}\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  (2.73) 
where $\mathbf{z}\sim {q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$.
This is also known as the likelihood ratio estimator (Glynn, 1990; Fu, 2006) and the REINFORCE gradient estimator (Williams, 1992). The method has been successfully used in various methods like neural variational inference (Mnih and Gregor, 2014), blackbox variational inference (Ranganath et al., 2014), automated variational inference (Wingate and Weber, 2013), and variational stochastic search (Paisley et al., 2012), often in combination with various novel control variate techniques (Glasserman, 2013) for variance reduction. An advantage of the likelihood ratio estimator is its applicability to discrete latent variables.
We do not directly compare to these techniques, since we concern ourselves with continuous latent variables, in which case we have (computationally cheap) access to gradient information ${\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$, courtesy of the backpropagation algorithm. The score function estimator solely uses the scalarvalued $\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$, ignoring the gradient information about the function $\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$, generally leading to much higher variance. This has been experimentally confirmed by e.g. (Kucukelbir et al., 2016), which finds that a sophisticated score function estimator requires two orders of magnitude more samples to arrive at the same variance as a reparameterization based estimator.
The difference in efficiency of our proposed reparameterizationbased gradient estimator, compared to score function estimators, can intuitively be understood as removing an information bottleneck during the computation of gradients of the ELBO w.r.t. $\mathit{\varphi}$ from current parameters $\bm{\theta}$: in the latter case, this computation is bottlenecked by the scalar value $\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$, while in the former case it is bottlenecked by the much wider vector ${\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$.
Chapter 3 Beyond Gaussian Posteriors
In this chapter we discuss techniques for improving the flexibility of the inference model ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$. Increasing the flexibility and accuracy of the inference model wel generally improve the tightness of the variational bound (ELBO), bringing it closer the true marginal likelihood objective.
3.1 Requirements for Computational Tractability
Requirements for the inference model, in order to be able to efficiently optimize the ELBO, are that it is (1) computationally efficient to compute and differentiate its probability density ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, and (2) computationally efficient to sample from, since both these operations need to be performed for each datapoint in a minibatch at every iteration of optimization. If $\mathbf{z}$ is highdimensional and we want to make efficient use of parallel computational resources like GPUs, then parallelizability of these operations across dimensions of $\mathbf{z}$ is a large factor towards efficiency. This requirement restricts the class of approximate posteriors $q(\mathbf{z}\mathbf{x})$ that are practical to use. In practice this often leads to the use of simple Gaussian posteriors. However, as explained, we also need the density $q(\mathbf{z}\mathbf{x})$ to be sufficiently flexible to match the true posterior $p(\mathbf{z}\mathbf{x})$, in order to arrive at a tight bound.
3.2 Improving the Flexibility of Inference Models
Here we will review two general techniques for improving the flexibility of approximate posteriors in the context of gradientbased variational inference: auxiliary latent variables, and normalizing flows.
3.2.1 Auxiliary Latent Variables
One method for improving the flexibility of inference models, is through the introduction of auxiliary latent variables, as explored by Salimans et al. (2015), (Ranganath et al., 2015) and Maaløe et al. (2016).
The methods work by augmenting both the inference model and the generative model with a continuous auxiliary variable, here denoted with $\mathbf{u}$.
The inference model defines a distribution over both $\mathbf{u}$ and and $\mathbf{z}$, which can, for example, factorize as:
${q}_{\mathit{\varphi}}(\mathbf{u},\mathbf{z}\mathbf{x})={q}_{\mathit{\varphi}}(\mathbf{u}\mathbf{x}){q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{u},\mathbf{x})$  (3.1) 
This inference model augmented with $\mathbf{u}$, implicitly defines a potentially powerful implicit marginal distribution:
${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$  $={\displaystyle \int {q}_{\mathit{\varphi}}(\mathbf{u},\mathbf{z}\mathbf{x})\mathit{d}\mathbf{u}}$  (3.2) 
Likewise, we introduce an additional distribution in the generative model: such that our generative model is now over the joint distribution ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u})$. This can, for example, factorize as:
${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u})={p}_{\bm{\theta}}(\mathbf{u}\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$  (3.3) 
The ELBO objective with auxiliary variables, given empirical distribution ${q}_{\mathcal{D}}(\mathbf{x})$, is then (again) equivalent to minimization of a KL divergence:
${\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[{\mathbb{E}}_{{q}_{\mathit{\varphi}}(\mathbf{u},\mathbf{z}\mathbf{x})}\left[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{u},\mathbf{z}\mathbf{x})\right]\right]$  (3.4)  
$={D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z},\mathbf{u}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u})$  (3.5) 
Recall that maximization of the original ELBO objective, without auxiliary variables, is equivalent to minimization of ${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$, and that maximization of the expected marginal likelihood is equivalent to minimization of ${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))$.
We can gain additional understanding into the relationship between the objectives, through the following equation:
${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z},\mathbf{u}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u}))$  (3.6)  
$={D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))+{\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x},\mathbf{z})}\left[{D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{u}\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{u}\mathbf{x},\mathbf{z}))\right]$  
$\ge {D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$  (3.7)  
$={D}_{KL}({q}_{\mathcal{D}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))+{\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x})}\left[{D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))\right]$  (3.8)  
$\ge {D}_{KL}({q}_{\mathcal{D}}(\mathbf{x}){p}_{\bm{\theta}}(\mathbf{x}))$  (3.9) 
From this equation it can be seen that in principle, the ELBO gets worse by augmenting the VAE with an auxiliary variable $\mathbf{u}$:
${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z},\mathbf{u}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z},\mathbf{u}))\ge {D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$ 
But because we now have access to a much more flexible class of inference distributions, ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$,
the original ELBO objective ${D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}))$ can improve, potentially outweighing the additional cost of
${\mathbb{E}}_{{q}_{\mathcal{D}}(\mathbf{x},\mathbf{z})}\left[{D}_{KL}({q}_{\mathcal{D},\mathit{\varphi}}(\mathbf{u}\mathbf{x},\mathbf{z}){p}_{\bm{\theta}}(\mathbf{u}\mathbf{x},\mathbf{z}))\right]$. In Salimans et al. (2015), (Ranganath et al., 2015) and Maaløe et al. (2016) it was shown that auxiliary variables can indeed lead to significant improvements in models.
The introduction of auxiliary latent variables in the graph, are a special case of VAEs with multiple layers of latent variables, which are discussed in the chapter 4. In our experiment with CIFAR10, we make use of multiple layers of stochastic variables.
3.2.2 Normalizing Flows
An alternative approach towards flexible approximate posteriors is Normalizing Flow (NF), introduced by (Rezende and Mohamed, 2015) in the context of stochastic gradient variational inference. In normalizing flows, we build flexible posterior distributions through an iterative procedure. The general idea is to start off with an initial random variable with a relatively simple distribution with a known (and computationally cheap) probability density function, and then apply a chain of invertible parameterized transformations ${\mathbf{f}}_{t}$, such that the last iterate ${\mathbf{z}}_{T}$ has a more flexible distribution^{1}^{1} 1 where $\mathbf{x}$ is the context, such as the value of the datapoint. In case of models with multiple levels of latent variables, the context also includes the value of the previously sampled latent variables.:
${\mathit{\u03f5}}_{0}\sim p(\mathit{\u03f5})$  (3.10)  
$\text{for}t=1\mathrm{\dots}T:$  (3.11)  
$\mathrm{\hspace{1em}}{\mathit{\u03f5}}_{t}={\mathbf{f}}_{t}({\mathit{\u03f5}}_{t1},\mathbf{x})$  (3.12)  
$\mathbf{z}={\mathit{\u03f5}}_{T}$  (3.13) 
The Jacobian of the transformation factorizes:
$\frac{d\mathbf{z}}{d{\mathit{\u03f5}}_{0}}}={\displaystyle \prod _{t=1}^{T}}{\displaystyle \frac{d{\mathit{\u03f5}}_{t}}{d{\mathit{\u03f5}}_{t1}}$  (3.14) 
So its determinant also factorizes as well:
$\mathrm{log}\leftdet\left({\displaystyle \frac{d\mathbf{z}}{d{\mathit{\u03f5}}_{0}}}\right)\right={\displaystyle \sum _{t=1}^{T}}\mathrm{log}\leftdet\left({\displaystyle \frac{d{\mathit{\u03f5}}_{t}}{d{\mathit{\u03f5}}_{t1}}}\right)\right$  (3.15) 
As long as the Jacobian determinant of each of the transformations ${\mathbf{f}}_{t}$ can be computed, we can still compute the probability density function of the last iterate:
$\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})=\mathrm{log}p({\mathit{\u03f5}}_{0}){\displaystyle \sum _{t=1}^{T}}\mathrm{log}det\left{\displaystyle \frac{d{\mathit{\u03f5}}_{t}}{d{\mathit{\u03f5}}_{t1}}}\right$  (3.16) 
Rezende and Mohamed (2015) experimented with a transformation of the form:
${\mathbf{f}}_{t}({\mathit{\u03f5}}_{t1})={\mathit{\u03f5}}_{t1}+\mathbf{u}h({\mathbf{w}}^{T}{\mathit{\u03f5}}_{t1}+b)$  (3.17) 
where $\mathbf{u}$ and $\mathbf{w}$ are vectors, ${\mathbf{w}}^{T}$ is $\mathbf{w}$ transposed, $b$ is a scalar and $h(.)$ is a nonlinearity, such that $\mathbf{u}h({\mathbf{w}}^{T}{\mathbf{z}}_{t1}+b)$ can be interpreted as a MLP with a bottleneck hidden layer with a single unit. This flow does not scale well to a highdimensional latent space: since information goes through the single bottleneck, a long chain of transformations is required to capture highdimensional dependencies.
3.3 Inverse Autoregressive Transformations
In order to find a type of normalizing flow that scales well to a highdimensional space, Kingma et al. (2016) consider Gaussian versions of autoregressive autoencoders such as MADE (Germain et al., 2015) and the PixelCNN (van den Oord et al., 2016b). Let $\mathbf{y}$ be a variable modeled by such a model, with some chosen ordering on its elements $\mathbf{y}={\{{y}_{i}\}}_{i=1}^{D}$. We will use $[\bm{\mu}(\mathbf{y}),\bm{\sigma}(\mathbf{y})]$ to denote the function of the vector $\mathbf{y}$, to the vectors $\bm{\mu}$ and $\bm{\sigma}$. Due to the autoregressive structure, the Jacobian matrix is triangular with zeros on the diagonal: $\partial [{\bm{\mu}}_{i},{\bm{\sigma}}_{i}]/\partial {\mathbf{y}}_{j}=[0,0]$ for $j\ge i$. The elements $[{\mu}_{i}({\mathbf{y}}_{1:i1}),{\sigma}_{i}({\mathbf{y}}_{1:i1})]$ are the predicted mean and standard deviation of the $i$th element of $\mathbf{y}$, which are functions of only the previous elements in $\mathbf{y}$.
Sampling from such a model is a sequential transformation from a noise vector $\mathit{\u03f5}\sim \mathcal{N}(0,\mathbf{I})$ to the corresponding vector $\mathbf{y}$: ${y}_{0}={\mu}_{0}+{\sigma}_{0}\odot {\u03f5}_{0}$, and for $i>0$, ${y}_{i}={\mu}_{i}({\mathbf{y}}_{1:i1})+{\sigma}_{i}({\mathbf{y}}_{1:i1})\cdot {\u03f5}_{i}$. The computation involved in this transformation is clearly proportional to the dimensionality $D$. Since variational inference requires sampling from the posterior, such models are not interesting for direct use in such applications. However, the inverse transformation is interesting for normalizing flows. As long as we have ${\sigma}_{i}>0$ for all $i$, the sampling transformation above is a onetoone transformation, and can be inverted:
${\u03f5}_{i}={\displaystyle \frac{{y}_{i}{\mu}_{i}({\mathbf{y}}_{1:i1})}{{\sigma}_{i}({\mathbf{y}}_{1:i1})}}$  (3.18) 
Kingma et al. (2016) make two key observations, important for normalizing flows. The first is that this inverse transformation can be parallelized, since (in case of autoregressive autoencoders) computations of the individual elements ${\u03f5}_{i}$ do not depend on each other. The vectorized transformation is:
$\mathit{\u03f5}=(\mathbf{y}\bm{\mu}(\mathbf{y}))/\bm{\sigma}(\mathbf{y})$  (3.19) 
where the subtraction and division are elementwise.
The second key observation, is that this inverse autoregressive operation has a simple Jacobian determinant. Note that due to the autoregressive structure, $\partial [{\mu}_{i},{\sigma}_{i}]/\partial {y}_{j}=[0,0]$ for $j\ge i$. As a result, the transformation has a lower triangular Jacobian ($\partial {\u03f5}_{i}/\partial {y}_{j}=0$ for $j>i$), with a simple diagonal: $\partial {\u03f5}_{i}/\partial {y}_{i}=\frac{1}{{\sigma}_{i}}$. The determinant of a lower triangular matrix equals the product of the diagonal terms. As a result, the logdeterminant of the Jacobian of the transformation is remarkably simple and straightforward to compute:
$\mathrm{log}det\left{\displaystyle \frac{d\mathit{\u03f5}}{d\mathbf{y}}}\right={\displaystyle \sum _{i=1}^{D}}\mathrm{log}{\sigma}_{i}(\mathbf{y})$  (3.20) 
The combination of model flexibility, parallelizability across dimensions, and simple logdeterminant, makes this transformation interesting for use as a normalizing flow over highdimensional latent space.
For the following section we will use a slightly different, but equivalently flexible, transformation of the type:
$\mathit{\u03f5}=\bm{\sigma}(\mathbf{y})\odot \mathbf{y}+\bm{\mu}(\mathbf{y})$  (3.21) 
With corresponding logdeterminant:
$\mathrm{log}det\left{\displaystyle \frac{d\mathit{\u03f5}}{d\mathbf{y}}}\right={\displaystyle \sum _{i=1}^{D}}\mathrm{log}{\sigma}_{i}(\mathbf{y})$  (3.22) 
3.4 Inverse Autoregressive Flow (IAF)
Kingma et al. (2016) propose inverse autoregressive flow (IAF) based on a chain of transformations that are each equivalent to an inverse autoregressive transformation of eq. (3.19) and eq. (3.21). See algorithm 3 for pseudocode of an approximate posterior with the proposed flow. We let an initial encoder neural network output ${\bm{\mu}}_{0}$ and ${\bm{\sigma}}_{0}$, in addition to an extra output $\mathbf{h}$, which serves as an additional input to each subsequent step in the flow. The chain is initialized with a factorized Gaussian ${q}_{\mathit{\varphi}}({\mathbf{z}}_{0}\mathbf{x})=\mathcal{N}(0,\text{diag}{(\bm{\sigma})}^{2})$:
${\mathit{\u03f5}}_{0}$  $\sim \mathcal{N}(0,I)$  (3.23)  
$({\bm{\mu}}_{0},\mathrm{log}{\bm{\sigma}}_{0},\mathbf{h})$  $=\text{EncoderNeuralNet}(\mathbf{x};\bm{\theta})$  (3.24)  
${\mathbf{z}}_{0}$  $={\bm{\mu}}_{0}+{\bm{\sigma}}_{0}\odot {\mathit{\u03f5}}_{0}$  (3.25) 
IAF then consists of a chain of $T$ of the following transformations:
$({\bm{\mu}}_{t},{\bm{\sigma}}_{t})$  $={\text{AutoregressiveNeuralNet}}_{t}({\mathit{\u03f5}}_{t1},\mathbf{h};\bm{\theta})$  (3.26)  
${\mathit{\u03f5}}_{t}$  $={\bm{\mu}}_{t}+{\bm{\sigma}}_{t}\odot {\mathit{\u03f5}}_{t1}$  (3.27) 
Each step of this flow is an inverse autoregressive transformation of the type of eq. (3.19) and eq. (3.21), and each step uses a separate autoregressive neural network. Following eq. (3.16), the density under the final iterate is:
$\mathbf{z}$  $\equiv {\mathit{\u03f5}}_{T}$  (3.28)  
$\mathrm{log}q(\mathbf{z}\mathbf{x})$  $={\displaystyle \sum _{i=1}^{D}}\left(\frac{1}{2}{\u03f5}_{i}^{2}+\frac{1}{2}\mathrm{log}(2\pi )+{\displaystyle \sum _{t=0}^{T}}\mathrm{log}{\sigma}_{t,i}\right)$  (3.29) 
The flexibility of the distribution of the final iterate ${\mathit{\u03f5}}_{T}$, and its ability to closely fit to the true posterior, increases with the expressivity of the autoregressive models and the depth of the chain. See figure 3.1 for an illustration of the computation.
A numerically stable version, inspired by the LSTMtype update, is where we let the autoregressive network output $({\mathbf{m}}_{t},{\mathbf{s}}_{t})$, two unconstrained realvalued vectors, and compute ${\mathit{\u03f5}}_{t}$ as:
$({\mathbf{m}}_{t},{\mathbf{s}}_{t})$  $={\text{AutoregressiveNeuralNet}}_{t}({\mathit{\u03f5}}_{t1},\mathbf{h};\bm{\theta})$  (3.30)  
${\bm{\sigma}}_{t}$  $=\text{sigmoid}({\mathbf{s}}_{t})$  (3.31)  
${\mathit{\u03f5}}_{t}$  $={\bm{\sigma}}_{t}\odot {\mathit{\u03f5}}_{t1}+(1{\bm{\sigma}}_{t})\odot {\mathbf{m}}_{t}$  (3.32) 
This version is shown in algorithm 3. Note that this is just a particular version of the update of eq. (3.27), so the simple computation of the final logdensity of eq. (3.29) still applies.
It was found beneficial for results to parameterize or initialize the parameters of each ${\text{AutoregressiveNeuralNet}}_{t}$ such that its outputs ${\mathbf{s}}_{t}$ are, before optimization, sufficiently positive, such as close to +1 or +2. This leads to an initial behavior that updates $\mathit{\u03f5}$ only slightly with each step of IAF. Such a parameterization is known as a ’forget gate bias’ in LSTMs, as investigated by Jozefowicz et al. (2015).
It is straightforward to see that a special case of IAF with one step, and a linear autoregressive model, is the fully Gaussian posterior discussed earlier. This transforms a Gaussian variable with diagonal covariance, to one with linear dependencies, i.e. a Gaussian distribution with full covariance.
Autoregressive neural networks form a rich family of nonlinear transformations for IAF. For nonconvolutional models, the family of masked autoregressive network introduced in (Germain et al., 2015) was used as the autoregressive neural networks. For CIFAR10 experiments, which benefits more from scaling to high dimensional latent space, the family of convolutional autoregressive autoencoders introduced by (van den Oord et al., 2016b, a) was used.
It was found that results improved when reversing the ordering of the variables after each step in the IAF chain. This is a volumepreserving transformation, so the simple form of eq. (3.29) remains unchanged.
3.5 Related work
As we explained, inverse autoregressive flow (IAF) is a member of the family of normalizing flows, first discussed in (Rezende and Mohamed, 2015) in the context of stochastic variational inference. In (Rezende and Mohamed, 2015) two specific types of flows are introduced: planar flows and radial flows. These flows are shown to be effective to problems relatively lowdimensional latent space (at most a few hundred dimensions). It is not clear, however, how to scale such flows to much higherdimensional latent spaces, such as latent spaces of generative models of /larger images, and how planar and radial flows can leverage the topology of latent space, as is possible with IAF. Volumeconserving neural architectures were first presented in in (Deco and Brauer, 1995), as a form of nonlinear independent component analysis.
Another type of normalizing flow, introduced by (Dinh et al., 2014) (NICE), uses similar transformations as IAF. In contrast with IAF, NICE was directly applied to the observed variables in a generative model. NICE is type of transformations that updates only half of the variables ${\mathbf{z}}_{1:D/2}$ per step, adding a vector $f({\mathbf{z}}_{D/2+1:D})$ which is a neural network based function of the remaining latent variables ${\mathbf{z}}_{D/2+1:D}$. Such large blocks have the advantage of computationally cheap inverse transformation, and the disadvantage of typically requiring longer chains. In experiments, (Rezende and Mohamed, 2015) found that this type of transformation is generally less powerful than other types of normalizing flow, in experiments with a lowdimensional latent space. Concurrently to our work, NICE was extended to highdimensional spaces in (Dinh et al., 2016) (Real NVP). An empirical comparison would be an interesting subject of future research.
A potentially powerful transformation is the Hamiltonian flow used in Hamiltonian Variational Inference (Salimans et al., 2015). Here, a transformation is generated by simulating the flow of a Hamiltonian system consisting of the latent variables $\mathbf{z}$, and a set of auxiliary momentum variables. This type of transformation has the additional benefit that it is guided by the exact posterior distribution, and that it leaves this distribution invariant for small step sizes. Such as transformation could thus take us arbitrarily close to the exact posterior distribution if we can apply it for a sufficient number of times. In practice, however, Hamiltonian Variational Inference is very demanding computationally. Also, it requires an auxiliary variational bound to account for the auxiliary variables, which can impede progress if the bound is not sufficiently tight.
An alternative method for increasing the flexibility of variational inference, is the introduction of auxiliary latent variables (Salimans et al., 2015; Ranganath et al., 2015; Tran et al., 2015), discussed in 3.2.1, and corresponding auxiliary inference models. Latent variable models with multiple layers of stochastic variables, such as the one used in our experiments, are often equivalent to such auxiliaryvariable methods. We combine deep latent variable models with IAF in our experiments, benefiting from both techniques.
Chapter 4 Deeper Generative Models
In the previous chapter we explain advanced strategies for improving inference models. In this chapter, we review strategies for learning deeper generative model, such as inference and learning with multiple latent variables or observed variables, and techniques for improving the flexibility of the generative models ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$.
4.1 Inference and Learning with Multiple Latent Variables
The generative model ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$, and corresponding inference model ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ can be parameterized as any directed graph. Both $\mathbf{x}$ and $\mathbf{z}$ can be composed of multiple variables with some topological ordering. It may not be immediately obvious how to optimize such models in the VAE framework; it is, however, quite straightforward, as we will now explain.
Let $\mathbf{z}=\{{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{K}\}$, and ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})={q}_{\mathit{\varphi}}({\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{K}\mathbf{x})$ where subscript corresponds with the topological ordering of each variable. Given a datapoint $\mathbf{x}$, computation of the ELBO estimator consists of two steps:

1.
Sampling $\mathbf{z}\sim {q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$. In case of multiple latent variables, this means ancestral sampling the latent variables one by one, in topological ordering defined by the inference model’s directed graph. In pseudocode, the ancestral sampling step looks like:
$\text{for}i=1\mathrm{\dots}K:$ (4.1) $\mathrm{\hspace{1em}}{\mathbf{z}}_{i}\sim {q}_{\mathit{\varphi}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$ (4.2) where $Pa({\mathbf{z}}_{i})$ are the parents of variable ${\mathbf{z}}_{i}$ in the inference model, which may include $\mathbf{x}$. In reparameterized (and differentiable) form, this is:
$\text{for}i=1\mathrm{\dots}K:$ (4.3) $\mathrm{\hspace{1em}}{\mathit{\u03f5}}_{i}\sim p({\mathit{\u03f5}}_{i})$ (4.4) $\mathrm{\hspace{1em}}{\mathbf{z}}_{i}={\mathbf{g}}_{i}({\mathit{\u03f5}}_{i},Pa({\mathbf{z}}_{i}),\mathit{\varphi})$ (4.5) 
2.
Evaluating the scalar value $(\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}))$ at the resulting sample $\mathbf{z}$ and datapoint $\mathbf{x}$. This scalar is the unbiased stochastic estimate lower bound on $\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$. It is also differentiable and optimizable with SGD.
4.1.1 Choice of ordering
It should be noted that the choice of latent variables’ topological ordering for the inference model, can be different from the choice of ordering for the generative model.
Since the inference model has the data as root node, while the generative model has the data as leaf node, one (in some sense) logical choice would be to let the topological ordering of the latent variables in the inference model, be the reverse of the ordering in the generative model.
In multiple works (Salimans, 2016; Kaae Sønderby et al., 2016; Kingma et al., 2016) it has been shown that it can be advantageous to let the generative model and inference model share the topological ordering of latent variables. The two choices of ordering are illustrated in figure 4.1. One advantage of shared ordering, as explained in these works, is that this allows us to easily share parameters between the inference and generative models, leading to faster learning and better solutions.
To see why this might be a good idea, one should realize that the true posterior over the latent variables, is a function of the prior:
${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})\propto {p}_{\bm{\theta}}(\mathbf{z}){p}_{\bm{\theta}}(\mathbf{x}\mathbf{z})$  (4.6) 
Likewise, the posterior of a latent variable given its parents (in the generative model), is:
${p}_{\bm{\theta}}({\mathbf{z}}_{i}\mathbf{x},Pa({\mathbf{z}}_{i}))\propto {p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i})){p}_{\bm{\theta}}(\mathbf{x}{\mathbf{z}}_{i},Pa({\mathbf{z}}_{i}))$  (4.7) 
Optimization of the generative model changes both ${p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$ and ${p}_{\bm{\theta}}(\mathbf{x}{\mathbf{z}}_{i},Pa({\mathbf{z}}_{i}))$. By coupling the inference model ${q}_{\mathit{\varphi}}({\mathbf{z}}_{i}\mathbf{x},Pa({\mathbf{z}}_{i}))$ and prior ${p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$, changes in ${p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$ can be directly reflected in changes in ${q}_{\mathit{\varphi}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$.
This coupling is especially straightforward when ${p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$ is Gaussian distributed. The inference model can be directly specified as the product of this Gaussian distribution, with a learned quadratic pseudolikelihood term: ${q}_{\mathit{\varphi}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}),\mathbf{x})={p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))\stackrel{~}{l}({\mathbf{z}}_{i};\mathbf{x},Pa({\mathbf{z}}_{i}))/Z$, where $Z$ is tractable to compute. This idea explored by (Salimans, 2016) and (Kaae Sønderby et al., 2016). In principle this idea could be extended to a more general class of conjugate priors, by no work on this is known at the time of writing.
A less constraining variant, explored by Kingma et al. (2016), is to simply let the neural network that parameterizes ${q}_{\mathit{\varphi}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}),\mathbf{x})$ be partially specified by a part of the neural network that parameterizes ${p}_{\bm{\theta}}({\mathbf{z}}_{i}Pa({\mathbf{z}}_{i}))$. In general, we can let the two distributions share parameters. This allows for more complicated posteriors, like normalizing flows or IAF.
4.2 Alternative methods for increasing expressivity of generative models
Typically, especially with large data sets, we wish to choose an expressive class of directed models, such that it can feasibly approximate the true distribution. Popular strategies for specifying expressive models are:

•
Introduction of latent variables into the directed models, and optimization through (amortized) variational inference, as explained in this work.

•
Full autoregression: factorization of distributions into univariate (onedimensional) conditionals, or at least very lowdimensional conditionals (section 4.3).

•
Specification of distributions through invertible transformations with tractable Jacobian determinant (section 4.4).
Synthesis from fully autoregressive models models is relatively slow, since the length of computation for synthesis from such models is linear in the dimensionality of the data. The length of computation of the loglikelihood of fully autoregressive models does not necesarilly scale with the dimensionality of the data In this respect, introduction of latent variables for improving expressivity is especially interesting when $\mathbf{x}$ is very highdimensional. It is relatively straightforward and computationally attractive, due to parallelizability, to specify directed models over highdimensional variables where each conditional factorizes into independent distributions. For example, if we let ${p}_{\bm{\theta}}({\mathbf{x}}_{j}Pa({\mathbf{x}}_{j}))={\prod}_{k}{p}_{\bm{\theta}}({x}_{j,k}Pa({\mathbf{x}}_{j})$, where each factor is a univariate Gaussian whose means and variance are nonlinear functions (specified by a neural network) of the parents $Pa({\mathbf{x}}_{j})$, then computations for both synthesis and evaluation of loglikelihood can be fully parallelized across dimensions $k$. See (Kingma et al., 2016) for experiments demonstrating a 100x improvement in speed of synthesis.
The best models to date, in terms of attained loglikelihood on test data, employ a combination of the three approaches listed above.
4.3 Autoregressive Models
A powerful strategy for modeling highdimensional data is to divide up the highdimensional observed variables into small constituents (often single dimensional parts, or otherwise just parts with a small number of dimensions), impose a certain ordering, and to model their dependencies as a directed graphical model. The resulting directed graphical model breaks up the joint distribution into a product of a factors:
${p}_{\bm{\theta}}(\mathbf{x})={p}_{\bm{\theta}}({x}_{1},\mathrm{\dots},{x}_{D})={p}_{\bm{\theta}}({x}_{1}){\displaystyle \prod _{j=2}^{T}}{p}_{\bm{\theta}}({x}_{j}Pa({\mathbf{x}}_{j}))$  (4.8) 
where $D$ is the dimensionality of the data. This is known as an autoregressive (AR) model. In case of neural network based autoregressive models, we let the conditional distributions be parameterized with a neural network:
$$  (4.9) 
In case of continuous data, autoregressive models can be interpreted as a special case of a more general approach: learning an invertible transformation from the data to a simpler, known distribution such as a Gaussian or Uniform distribution; this approach with invertible transformations is discussed in section 4.4. The techniques of autoregressive models and invertible transformations can be naturally combined with variational autoencoders, and at the time of writing, the best systems use a combination Rezende and Mohamed (2015); Kingma et al. (2016); Gulrajani et al. (2016).
A disadvantage of autoregressive models, compared to latentvariable models, is that ancestral sampling from autoregressive models is a sequential operation computation of $\mathcal{O}(D)$ length, i.e. proportional to the dimensionality of the data. Autoregressive models also requires one to choose a specific ordering of input elements (equation (4.8)). When no single natural onedimensional ordering exists, like in twodimensional images, this leads to a model with a somewhat awkward inductive bias.
4.4 Invertible transformations with tractable Jacobian determinant
In case of continuous data, autoregressive models can be interpreted as a special case of a more general approach: learning an invertible transformation with tractable Jacobian determinant (also called normalizing flow) from the data to a simpler, known distribution such as a Gaussian or Uniform distribution. If we use neural networks for such invertible mappings, this is a powerful and flexible approach towards probabilistic modeling of continuous data and nonlinear independent component analysis (Deco and Brauer, 1995).
Such normalizing flows iteratively update a variable, which is constrained to be of the same dimensionality as the data, to a target distribution. This constraint on the dimensionality if intermediate states of the mapping, can make such transformations more challenging to optimize than methods without such constraint. An obvious advantage, on the other hand, is that the likelihood and its gradient are tractable. In (Dinh et al., 2014, 2016), particularly interesting flows (NICE and Real NVP) were introduced, with equal computational cost and depth in both directions, making it both relatively cheap to optimize and to sample from such models. At the time of writing, no such models has not yet been demonstrated to lead to the similar performance as purely autoregressive or VAEbased models in terms of data loglikelihood, but this remains an active area of research.
4.5 FollowUp Work
Some important applications and motivations for deep generative models and variational autoencoders are:

•
Representation learning: learning better representations of the data. Some uses of this are:

–
Dataefficient learning, such as semisupervised learning

–
Visualisation of data as lowdimensional manifolds

–

•
Artificial creativity: plausible interpolation between data and extrapolation from data.
Here we will now highlight some concrete applications to representation learning and artificial creativity.
4.5.1 Representation Learning
In case of supervised learning, we typically aim to learn a conditional distribution: to predict the distribution over the possible values of a variable, given the value of some another variable. One such problem is that of image classification: given an image, the prediction of a distribution over the possible class labels. Through the yearly ImageNet competion (Russakovsky et al., 2015), it has become clear that deep convolutional neural networks (LeCun et al., 1998; Goodfellow et al., 2016) (CNNs), given a large amount of labeled images, are extraordinarily good at solving the image classification task. Modern versions of CNNs based on residual networks, which is a variant of LSTMtype neural networks (Hochreiter and Schmidhuber, 1997), now arguably achieves humanlevel classification accuracy on this task (He et al., 2015b, a).
When the number of labeled examples is low, solutions found with purely supervised approaches tend to exhibit poor generalization to new data. In such cases, generative models can be employed as an effective type of regularization. One particular strategy, presented in Kingma et al. (2014), is to optimize the classification model jointly with a variational autoencoder over the input variables, sharing parameters between the two. The variational autoencoder, in this case, provides an auxiliary objective, improving the data efficiency of the classification solution. Through sharing of statistical strength between modeling problems, this can greatly improve upon the supervised classification error. Techniques based on VAEs are now among state of the art for semisupervised classification (Maaløe et al., 2016), with on average under 1% classification error in the MNIST classification problem, when trained with only 10 labeled images per class, i.e. when more than 99.8% of the labels in the training set were removed. In concurrent work (Rezende et al., 2016b), it was shown that VAEbased semisupervised learning can even do well when only a single sample per class is presented.
A standard supervised approach, GoogLeNet (Szegedy et al., 2015), which normally achieves near stateoftheart performance on the ImageNet validation set, achieves only around 5% top1 classification accuracy when trained with only 1% of the labeled images, as shown by Pu et al. (2016). In contrast, they show that a semisupervised approach with VAEs achieves around 45% classification accuracy on the same task, when modeling the labels jointly with the labeled and unlabeled input images.
4.5.2 Understanding of data, and artificial creativity
Generative models with latent spaces allow us to transform the data into a simpler latent space, explore it in that space, and understand it better. A related branch of applications of deep generative models is the synthesis of plausible pseudodata with certain desirable properties, sometimes coined as artificial creativity.
4.5.2.1 Chemical Design
One example of a recent scientific application of artificial creativity, is shown in GómezBombarelli et al. (2016). In this paper, a fairly straightforward VAE is trained on hundreds of thousands of existing chemical structures. The resulting continuous representation (latent space) is subsequently used to perform gradientbased optimization towards certain properties; the method is demonstrated on the design of druglike molecules and organic lightemitting diodes. See figure 4.2.
4.5.2.2 Natural Language Synthesis
A similar approach was used to generating naturallanguage sentences from a continuous space by Bowman et al. (2015). In this paper, it is shown how a VAE can be successfully trained on text. The model is shown to succesfully interpolate between sentences, and for imputation of missing words. See figure 4.3.
4.5.2.3 Astronomy
In (Ravanbakhsh et al., 2016), VAEs are applied to simulate observations of distant galaxies. This helps with the calibration of systems that need to indirectly detect the shearing of observations of distant galaxies, caused by weak gravitational lensing in the presence of dark matter between earth and those galaxies. Since the lensing effects are so weak, such systems need to be calibrated with groundtruth images with a known amount of shearing. Since real data is still limited, the proposed solution is to use deep generative models for synthesis of pseudodata.
4.5.2.4 Image (Re)Synthesis
A popular application is image (re)synthesis. One can optimize a VAE to form a generative model over images. One can synthesize images from the generative model, but the inference model (or encoder) also allows one to encode real images into a latent space. One can modify the encoding in this latent space, then decode the image back into the observed space. Relatively simply transformations in the observed space, such as linear transformations, often translate into semantically meaningful modifications of the original image. One example, as demonstrated by White (2016), is the modification of images in latent space along a "smile vector" in order to make them more happy, or more sad looking. See figure 4.4 for an example.
4.5.3 Other relevant followup work
We unfortunately do not have space to discuss all followup work in depth, but will here highlight a selection of relevant recent work.
In addition to our original publication (Kingma and Welling, 2013), two later papers have proposed equivalent algorithms Rezende et al. (2014); LázaroGredilla (2014), where the latter work applies the same reparameterization gradient method to the estimation of parameter posteriors, rather than amortized latentvariable inference.
In the appendix of (Kingma and Welling, 2013) we proposed to apply the reparameterization gradients to estimation of parameter posteriors. In (Blundell et al., 2015) this method, with a mixtureofGaussians prior and named Bayes by Backprop, was used in experiments with some promising early results. In (Kingma et al., 2015) we describe a refined method, the local reparameterization trick, for further decreasing the variance of the gradient estimator, and applied it to estimation of Gaussian parameter posteriors. Further results were presented in (Louizos et al., 2017; Louizos and Welling, 2017, 2016) with increasingly sophisticated choices of priors and approximate posteriors. In (Kingma et al., 2015; Gal and Ghahramani, 2016), a similar reparameterization was used to analyze Dropout as a Bayesian method, coined Variational Dropout. In (Molchanov et al., 2017) this method was further analyzed and refined. Various papers have applied reparameterization gradients for estimating parameter posteriors, including (Fortunato et al., 2017) in the context of recurrent neural networks and (Kucukelbir et al., 2016) more generally for Bayesian models and in (Tran et al., 2017) for deep probabilistic programming. A Bayesian nonparametric variational family based in the Gaussian Process using reparameterization gradients was proposed in (Tran et al., 2015).
Normalizing flows (Rezende and Mohamed, 2015) were proposed as a framework for improving the flexibility of inference models. In Kingma et al. (2016), the first normalizing flow was proposed that scales well to highdimensional latent spaces. The same principle was later applied in (Papamakarios et al., 2017) for density estimation, and further refined in (Huang et al., 2018). Various other flows were proposed in (Tomczak and Welling, 2016, 2017) and (Berg et al., 2018).
As an alternative to (or in conjunction with) normalizing flows, one can use auxiliary variables to improve posterior flexibility. This principle was, to the best of our knowledge, first proposed in Salimans et al. (2015). In this paper, the principle was used in a combination of variational inference with Hamiltonian Monte Carlo (HMC), with the momentum variables of HMC as auxiliary variables. Auxiliary variables were more elaborately discussed in in (Maaløe et al., 2016) as Auxiliary Deep Generative Models. Similarly, one can use deep models with multiple stochastic layers to improve the variational bound, as demonstrated in (Sønderby et al., 2016a) and (Sønderby et al., 2016b) as Ladder VAEs.
There has been plenty of followup work on gradient variance reduction for the variational parameters of discrete latent variables, as opposed to continuous latent variables for which reparameterization gradients apply. These proposals include NVIL (Mnih and Gregor, 2014), MuProp (Gu et al., 2015), Variational inference for Monte Carlo objectives (Mnih and Rezende, 2016), the Concrete distribution (Maddison et al., 2016) and Categorical Reparameterization with GumbelSoftmax (Jang et al., 2016).
The ELBO objective can be generalized into an importanceweighted objective, as proposed in (Burda et al., 2015) (ImportanceWeighted Autoencoders). This potentially reduces the variance in the gradient, but has not been discussed indepth here since (as often the case with importanceweighted estimators) it can be difficult to scale to highdimensional latent spaces. Other objectives have been proposed such as Rényi divergence variational inference (Li and Turner, 2016), Generative Moment Matching Networks (Li et al., 2015), objectives based on normalizing such as NICE and RealNVP flows (SohlDickstein et al., 2015; Dinh et al., 2014), blackbox $\alpha $divergence minimization (HernándezLobato et al., 2016) and Bidirectional Helmholtz Machines (Bornschein et al., 2016).
Various combinations with adversarial objectives have been proposed. In (Makhzani et al., 2015), the "adversarial autoencoder" (AAE) was proposed, a probabilistic autoencoder that uses a generative adversarial networks (GAN) to perform variational inference. In (Dumoulin et al., 2016) Adversarially Learned Inference (ALI) was proposed, which aims to minimize a GAN objective between the joint distributions ${q}_{\mathit{\varphi}}(\mathbf{x},\mathbf{z})$ and ${p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$. Other hybrids have been proposed as well (Larsen et al., 2015; Brock et al., 2016; Hsu et al., 2017).
One of the most prominent, and most difficult, applications of generative models is image modeling. In (Kulkarni et al., 2015) (Deep convolutional inverse graphics network), a convolutional VAE was applied to modeling images with some success, building on work by (Dosovitskiy et al., 2014) proposing convolutional networks for image synthesis. In (Gregor et al., 2015) (DRAW), an attention mechanism was combined with a recurrent inference model and recurrent generative model for image synthesis. This approach was further extended in (Gregor et al., 2016) (Towards Conceptual Compression) with convolutional networks, scalable to larger images, and applied to image compression. In (Kingma et al., 2016), deep convolutional inference models and generative models were also applied to images. Furthermore, (Gulrajani et al., 2016) (PixelVAE) and (Chen et al., 2016) (Variational Lossy Autoencoder) combined convolutional VAEs with the PixelCNN model (van den Oord et al., 2016b, a). Methods and VAE architectures for controlled image generation from attributes or text were studied in (Kingma et al., 2014; Yan et al., 2016; Mansimov et al., 2015; Brock et al., 2016; Yeh et al., 2016; White, 2016). Predicting the color of pixels based on a grayscale image is another promising application (Deshpande et al., 2016). The application to semisupervised learning has been studied in (Kingma et al., 2014; Pu et al., 2016; Xu et al., 2017) among other work.
Another prominent application of VAEs is modeling of text and or sequential data (Bayer and Osendorfer, 2014; Fabius and van Amersfoort, 2014; Krishnan et al., 2015; Bowman et al., 2015; Serban et al., 2016; Johnson et al., 2016; Karl et al., 2016; Fraccaro et al., 2016; Miao et al., 2016; Semeniuta et al., 2017; Zhao et al., 2017; Yang et al., 2017; Hu et al., 2017). VAEs have also been applied to speech and handwriting Chung et al. (2015). Sequential models typically use recurrent neural networks, such as LSTMs (Hochreiter and Schmidhuber, 1997), as encoder and/or decoder. When modeling sequences, the validity of a sequence can sometimes by constrained by a contextfree grammar. In this case, incorporation of the grammar in VAEs can lead to better models, as shown in (Kusner et al., 2017) (Grammar VAEs), and applied to modeling molecules in textual representations.
Since VAEs can transform discrete observation spaces to continuous latentvariable spaces with approximately known marginals, they are interesting for use in modelbased control (Watter et al., 2015; Pritzel et al., 2017). In (Heess et al., 2015b) (Stochastic Value Gradients) it was shown that the reparameterization of the observed variables, together with an observation model, can be used to compute novel forms of policy gradients. In (Heess et al., 2015a), the reparameterization trick was used for memorybased control. Variational inference and reparameterization gradients have also been used for variational information maximisation for intrinsically motivated reinforcement learning (Mohamed and Rezende, 2015) and VIME (Houthooft et al., 2016) for improved exploration. Variational autoencoders have also been used as components in models that perform iterative reasoning about objects in a scene (Eslami et al., 2016).
In (Higgins et al., 2016) ($\beta $VAE) it was proposed to strengthen the contribution of ${D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}))$, thus restricting the information flow through the latent space, which was shown to improve disentanglement of latent factors, further studied in (Chen et al., 2018).
Other applications include modeling of graphs (Kipf and Welling, 2016) (Variational Graph Autoencoders), learning of 3D structure from images (Rezende et al., 2016a), oneshot learning (Rezende et al., 2016b), learning nonlinear state space models (Krishnan et al., 2017), voice conversion from nonparallel corpora (Hsu et al., 2016), discriminationaware (fair) representations (Louizos et al., 2015) and transfer learning (Edwards and Storkey, 2016).
The reparameterization gradient estimator discussed in this work has been extended in various directions (Ruiz et al., 2016), including acceptancerejection sampling algorithms (Naesseth et al., 2017). The gradient variance can in some cases be reduced by ’carving up the ELBO’ (Hoffman and Johnson, 2016; Roeder et al., 2017) and using a modified gradient estimator. A secondorder gradient estimator has also been proposed in (Fan et al., 2015).
All in all, this remains an actively researched area with frequently exciting developments.
Chapter 5 Conclusion
Directed probabilistic models form an important aspect of modern artificial intelligence. Such models can be made incredibly flexible by parameterizing the conditional distributions with differentiable deep neural networks.
Optimization of such models towards the maximum likelihood objective is straightforward in the fullyobserved case. However, one is often more interested in flexible models with latent variables, such as deep latentvariable models, or Bayesian models with random parameters. In both cases one needs to perform approximate posterior estimation for which variational inference (VI) methods are suitable. In VI, inference is cast as an optimization problem over newly introduced variational parameters, typically optimized towards the ELBO, a lower bound on the model evidence, or marginal likelihood of the data. Existing methods for such posterior inference were either relatively inefficient, or not applicable to models with neural networks as components. Our main contribution is a framework for efficient and scalable gradientbased variational posterior inference and approximate maximum likelihood learning.
In this paper we describe the variational autoencoder and its extensions. It is a combination of a deep latentvariable model (DLVM) with continuous latent variables, and an associated inference model. The DLVM is a type of generative model over the data. The inference model, also called encoder or recognition model, approximates the posterior distribution of the latent variables of the generative model. Both the generative model and the inference model are directed graphical models that are wholly or partially parameterized by deep neural networks. The parameters of the models, including the parameters of the neural networks such as the weights and biases, are jointly optimized by performing stochastic gradient ascent on the socalled evidence lower bound (ELBO). The ELBO is a lower bound on the marginal likelihood of the data, also called the variational lower bound. Stochastic gradients, necessary for performing SGD, are obtained through a basic reparameterization trick. The VAE framework is now a commonly used tool for various applications of probabilistic modeling and artificial creativity, and basic implementations are available in most major deep learning software libraries.
For learning flexible inference models, we proposed inverse autoregressive flows (IAF), a type of normalizing flow that allows scaling to highdimensional latent spaces. The VAE framework is compatible with almost arbitrary differentiable neural networks. We showed that a VAE with an IAF posterior and ResNets, a novel type of neural network, can learn models of natural images that are close to stateoftheart in terms of loglikelihood, while allowing for orders of magnitude faster sampling. An interesting direction for further exploration is comparison with transformations with computationally cheap inverses, such as NICE (Dinh et al., 2014) and Real NVP (Dinh et al., 2016). Application of such transformations in the VAE framework can potentially lead to relatively simple VAEs with a combination of powerful posteriors, priors and decoders. Such architectures can potentially rival or surpass purely autoregressive architectures (van den Oord et al., 2016a), while allowing much faster synthesis.
The proposed VAE framework remains the only framework in the literature that allows for both discrete and continuous observed variables, allows for efficient amortized latentvariable inference and fast synthesis, and which can produce close to stateoftheart performance in terms of the loglikelihood of data.
Chapter 6 Appendix
6.1 Notation and definitions
6.1.1 Notation
Example(s)  Description 
$\mathbf{x},\mathbf{y}\mathbf{z}$  With characters in bold we typically denote random vectors. We also use this notation for collections of random variables variables. 
$x,y,z$  With characters in italic we typically denote random scalars, i.e. single realvalued numbers. 
$\mathbf{X},\mathbf{Y},\mathbf{Z}$  With bold and capitalized letters we typically denote random matrices. 
$Pa(\mathbf{z})$  The parents of random variable $\mathbf{z}$ in a directed graph. 
$\text{diag}(\mathbf{x})$  Diagonal matrix, with the values of vector $\mathbf{x}$ on the diagonal. 
$\mathbf{x}\odot \mathbf{y}$  Elementwise multiplication of two vectors. The resulting vector is ${({x}_{1}{y}_{1},\mathrm{\dots},{x}_{K}{y}_{K})}^{T}$. 
$\theta $  Parameters of a (generative) model are typically denoted with the Greek lowercase letter $\theta $ (theta). 
$\mathit{\varphi}$  Variational parameters are typically denoted with the bold Greek letter $\mathit{\varphi}$ (phi). 
$p(\mathbf{x}),p(\mathbf{z})$  Probability density functions (PDFs) and probability mass functions (PMFs), also simply called distributions, are denoted by $p(.)$, $q(.)$ or $r(.)$. 
$p(\mathbf{x},\mathbf{y},\mathbf{z})$  Joint distributions are denoted by $p(.,.)$ 
$p(\mathbf{x}\mathbf{z})$  Conditional distributions are denoted by $p(..)$ 
$p(.;\theta ),{p}_{\theta}(\mathbf{x})$  The parameters of a distribution are denoted with $p(.;\theta )$ or equivalently with subscript ${p}_{\theta}(.)$. 
$p(\mathbf{x}=\mathbf{a}),p(\mathbf{x}\le \mathbf{a})$  We may use an (in)equality sign within a probability distribution to distinguish between function arguments and value at which to evaluate. So $p(\mathbf{x}=\mathbf{a})$ denotes a PDF or PMF over variable $\mathbf{x}$ evaluated at the value of variable $\mathbf{a}$. Likewise, $p(\mathbf{x}\le \mathbf{a})$ denotes a CDF evaluated at the value of $\mathbf{a}$. 
$p(.),q(.)$  We use different letters to refer to different probabilistic models, such as $p(.)$ or $q(.)$. Conversely, we use the same letter across different marginals/conditionals to indicate they relate to the same probabilistic model. 
6.1.2 Definitions
Term  Description 
Probability density function (PDF)  A function that assigns a probability density to each possible value of given continuous random variables. 
Cumulative distribution function (CDF)  A function that assigns a cumulative probability density to each possible value of given univariate continuous random variables. 
Probability mass function (PMF)  A function that assigns a probability mass to given discrete random variable. 
6.1.3 Distributions
We overload the notation of distributions (e.g. $p(\mathbf{x})=\mathcal{N}(\mathbf{x};\bm{\mu},\mathbf{\Sigma})$) with two meanings: (1) a distribution from which we can sample, and (2) the probability density function (PDF) of that distribution.
Term  Description 
$\text{Categorical}(x;\mathbf{p})$  Categorical distribution, with parameter $\mathbf{p}$ such that ${\sum}_{i}{p}_{i}=1$. 
$\text{Bernoulli}(\mathbf{x};\mathbf{p})$  Multivariate distribution of independent Bernoulli. $\text{Bernoulli}(\mathbf{x};\mathbf{p})={\prod}_{i}\text{Bernoulli}({x}_{i};{p}_{i})$ with $\forall i:0\le {p}_{i}\le 1$. 
$\text{Normal}(\mathbf{x};\bm{\mu},\mathbf{\Sigma})=\mathcal{N}(\mathbf{x};\bm{\mu},\mathbf{\Sigma})$  Multivariate Normal distribution with mean $\bm{\mu}$ and covariance $\mathbf{\Sigma}$. 
6.1.3.1 Chain rule of probability
$p(\mathbf{a},\mathbf{b})=p(\mathbf{a})p(\mathbf{b}\mathbf{a})$  (6.1) 
6.1.3.2 Bayes’ Rule
$p(\mathbf{a}\mathbf{b})=p(\mathbf{b}\mathbf{a})p(\mathbf{a})/p(\mathbf{b})$  (6.2) 
6.1.4 Bayesian Inference
Let $p(\theta )$ be a chosen marginal distribution over its parameters $\theta $, called a prior distribution. Let $\mathcal{D}$ be observed data, $p(\mathcal{D}\theta )\equiv {p}_{\bm{\theta}}(\mathcal{D})$ be the probability assigned to the data under the model with parameters $\theta $. Recall the chain rule in probability:
$p(\theta ,\mathcal{D})=p(\theta \mathcal{D})p(\mathcal{D})=p(\theta )p(\mathcal{D}\theta )$ 
Simply rearranging terms above, the posterior distribution over the parameters $\theta $, taking into account the data $\mathcal{D}$, is:
$p(\theta \mathcal{D})={\displaystyle \frac{p(\mathcal{D}\theta )p(\theta )}{p(\mathcal{D})}}\propto p(\mathcal{D}\theta )p(\theta )$  (6.3) 
where the proportionality ($\propto $) holds since $p(\mathcal{D})$ is a constant that is not dependent on parameters $\theta $. The formula above is known as Bayes’ rule, a fundamental formula in machine learning and statistics, and is of special importance to this work.
A principal application of Bayes’ rule is that it allows us to make predictions about future data ${\mathbf{x}}^{\prime}$, that are optimal as long as the prior $p(\theta )$ and model class ${p}_{\bm{\theta}}(\mathbf{x})$ are correct:
$p(\mathbf{x}={\mathbf{x}}^{\prime}\mathcal{D})={\displaystyle \int}{p}_{\bm{\theta}}(\mathbf{x}={\mathbf{x}}^{\prime})p(\theta \mathcal{D})d\theta $ 
6.2 Alternative methods for learning in DLVMs
6.2.1 Maximum A Posteriori
From a Bayesian perspective, we can improve upon the maximum likelihood objective through maximum a posteriori (MAP) estimation, which maximizes the logposterior w.r.t. $\theta $. With i.i.d. data $\mathcal{D}$, this is:
${L}^{MAP}(\theta )$  $=\mathrm{log}p(\theta \mathcal{D})$  (6.4)  
$=\mathrm{log}p(\theta )+{L}^{ML}(\theta )+\text{constant}$  (6.5) 
The prior $p(\theta )$ in equation (6.5) has diminishing effect for increasingly large $N$. For this reason, in case of optimization with large datasets, we often choose to simply use the maximum likelihood criterion by omitting the prior from the objective, which is numerically equivalent to setting $p(\theta )=\text{constant}$.
6.2.2 Variational EM with local variational parameters
Expectation Maximization (EM) is a general strategy for learning parameters in partially observed models (Dempster et al., 1977). See section 6.2.3 for a discussion of EM using MCMC. The method can be explained as coordinate ascent on the ELBO (Neal and Hinton, 1998). In case of of i.i.d. data, traditional variational EM methods estimate local variational parameters ${\mathit{\varphi}}^{(i)}$, i.e. a separate set of variational parameters per datapoint $i$ in the dataset. In contrast, VAEs employ a strategy with global variational parameters.
EM starts out with some (random) initial choice of $\bm{\theta}$ and ${\mathit{\varphi}}^{(1:N)}$. It then iteratively applies updates:
$\forall i=1,\mathrm{\dots},N:{\mathit{\varphi}}^{(i)}\leftarrow \underset{\mathit{\varphi}}{argmax}\mathcal{L}({\mathbf{x}}^{(i)};\bm{\theta},\mathit{\varphi})$  (Estep)  (6.6)  
$\bm{\theta}\leftarrow \underset{\bm{\theta}}{argmax}{\displaystyle \sum _{i=1}^{N}}\mathcal{L}({\mathbf{x}}^{(i)};\bm{\theta},\mathit{\varphi})$  (Mstep)  (6.7) 
until convergence. Why does this work? Note that at the Estep:
$\underset{\mathit{\varphi}}{argmax}\mathcal{L}(\mathbf{x};\bm{\theta},\mathit{\varphi})$  (6.8)  
$=\underset{\mathit{\varphi}}{argmax}[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x}){D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))]$  (6.9)  
$=\underset{\mathit{\varphi}}{argmin}{D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))$  (6.10) 
so the $E$step, sensibly, minimizes the KL divergence of ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ from the true posterior.
Secondly, note that if ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$ equals ${p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$, the ELBO equals the marginal likelihood, but that for any choice of ${q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x})$, the $M$step optimizes a bound on the marginal likelihood. The tightness of this bound is defined by ${D}_{KL}({q}_{\mathit{\varphi}}(\mathbf{z}\mathbf{x}){p}_{\bm{\theta}}(\mathbf{z}\mathbf{x}))$.
6.2.3 MCMCEM
Another Bayesian approach towards optimizing the likelihood ${p}_{\bm{\theta}}(\mathbf{x})$ with DLVMs is Expectation Maximization (EM) with Markov Chain Monte Carlo (MCMC). In case of MCMC, the posterior is approximated by a mixture of a set of approximately i.i.d. samples from the posterior, acquired by running a Markov chain. Note that posterior gradients in DLVMs are relatively affordable to compute by differentiating the logjoint distribution w.r.t. $\mathbf{z}$:
${\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{z}\mathbf{x})$  $={\nabla}_{\mathbf{z}}\mathrm{log}[{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})/{p}_{\bm{\theta}}(\mathbf{x})]$  (6.11)  
$={\nabla}_{\mathbf{z}}[\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})]$  (6.12)  
$={\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z}){\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x})$  (6.13)  
$={\nabla}_{\mathbf{z}}\mathrm{log}{p}_{\bm{\theta}}(\mathbf{x},\mathbf{z})$  (6.14) 
One version of MCMC which uses such posterior for relatively fast convergence, is Hamiltonian MCMC (Neal, 2011). A disadvantage of this approach is the requirement for running an independent MCMC chain per datapoint.
6.3 Stochastic Gradient Descent
We work with directed models where the objective per datapoint is scalar, and due to the differentiability of neural networks that compose them, the objective is differentiable w.r.t. its parameters $\theta $. Due to the remarkable efficiency of reversemode automatic differentiation (also known as the backpropagation algorithm (Rumelhart et al., 1988)), the value and gradient (i.e. the vector of partial derivatives) of differentiable scalar objectives can be computed with equal time complexity. In SGD, we iteratively update parameters $\theta $:
${\theta}_{t+1}\leftarrow {\theta}_{t}+{\alpha}_{t}\cdot {\nabla}_{\theta}\stackrel{~}{L}(\theta ,\xi )$  (6.15) 
where ${\alpha}_{t}$ is a learning rate or preconditioner, and $\stackrel{~}{L}(\theta ,\xi )$ is an unbiased estimate of the objective $L(\theta )$, i.e. ${\mathbb{E}}_{\xi \sim p(\xi )}\left[\stackrel{~}{L}(\theta ,\xi )\right]=L(\theta )$. The random variable $\xi $ could e.g. be a datapoint index, uniformly sampled from $\{1,\mathrm{\dots},N\}$, but can also include different types of noise such posterior sampling noise in VAEs. In experiments, we have typically used the Adam and Adamax optimization methods for choosing ${\alpha}_{t}$ (Kingma and Ba, 2015); these methods are invariant to constant rescaling of the objective, and invariant to constant rescalings of the individual gradients. As a result, $\stackrel{~}{L}(\theta ,\xi )$ only needs to be unbiased up to proportionality. We iteratively apply eq. (6.15) until a stopping criterion is met. A simple but effective criterion is to stop optimization as soon as the probability of a holdout set of data starts decreasing; this criterion is called early stopping.
References
 Banerjee [2008] Arindam Banerjee. An Analysis of Logistic Models: Exponential Family Connections and Online Performance. 2008.
 Bayer and Osendorfer [2014] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 Bengio and ThibodeauLaufer [2013] Yoshua Bengio and Éric ThibodeauLaufer. Deep generative stochastic networks trainable by backprop. arXiv preprint arXiv:1306.1091, 2013.
 Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE, 2013.
 Berg et al. [2018] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
 Blei et al. [2012] David M Blei, Michael I Jordan, and John W Paisley. Variational Bayesian inference with Stochastic Search. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 1367–1374, 2012.
 Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Bornschein et al. [2016] Jorg Bornschein, Samira Shabanian, Asja Fischer, and Yoshua Bengio. Bidirectional helmholtz machines. In International Conference on Machine Learning, pages 2511–2519, 2016.
 Bourlard and Kamp [1988] Hervé Bourlard and Yves Kamp. Autoassociation by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(45):291–294, 1988.
 Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Brock et al. [2016] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
 Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Chen et al. [2018] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 Chen et al. [2016] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
 Chung et al. [2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 Cremer and Duvenaud [2017] Quaid Cremer, Chris ad Morris and David Duvenaud. Reinterpreting importance weighted autoencoders. ICLR 2017, 2017.
 Dayan et al. [1995] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
 Deco and Brauer [1995] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. Advances in Neural Information Processing Systems, pages 247–254, 1995.
 Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977.
 Deshpande et al. [2016] Aditya Deshpande, Jiajun Lu, MaoChuang Yeh, and David A Forsyth. Learning diverse image colorization. CoRR, abs/1612.01958, 1, 2016.
 Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. [2016] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
 Dosovitskiy et al. [2014] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. arXiv preprint arXiv:1411.5928, 2014.
 Dumoulin et al. [2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Edwards and Storkey [2016] Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 Eslami et al. [2016] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances In Neural Information Processing Systems, pages 3225–3233, 2016.
 Fabius and van Amersfoort [2014] Otto Fabius and Joost R van Amersfoort. Variational recurrent autoencoders. arXiv preprint arXiv:1412.6581, 2014.
 Fan et al. [2015] Kai Fan, Ziteng Wang, Jeff Beck, James Kwok, and Katherine A Heller. Fast second order stochastic backpropagation for variational inference. In Advances in Neural Information Processing Systems, pages 1387–1395, 2015.
 Fortunato et al. [2017] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv:1704.02798, 2017.
 Fraccaro et al. [2016] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207, 2016.
 Fu [2006] Michael C Fu. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.
 Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. arXiv preprint arXiv:1502.03509, 2015.
 Gershman and Goodman [2014] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In CogSci, 2014.
 Glasserman [2013] Paul Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
 Glynn [1990] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 GómezBombarelli et al. [2016] Rafael GómezBombarelli, David Duvenaud, José Miguel HernándezLobato, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016. URL http://www.deeplearningbook.org.
 Graves [2011] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
 Gregor et al. [2013] Karol Gregor, Andriy Mnih, and Daan Wierstra. Deep AutoRegressive Networks. arXiv preprint arXiv:1310.8499, 2013.
 Gregor et al. [2015] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Gregor et al. [2016] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pages 3549–3557, 2016.
 Grover et al. [2017] Aditya Grover, Manik Dhar, and Stefano Ermon. Flowgan: Bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868, 2017.
 Gu et al. [2015] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.
 Gulrajani et al. [2016] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 He et al. [2015a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015a.
 He et al. [2015b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015b.
 Heess et al. [2015a] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memorybased control with recurrent neural networks. arXiv preprint arXiv:1512.04455, 2015a.
 Heess et al. [2015b] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015b.
 HernándezLobato et al. [2016] José Miguel HernándezLobato, Yingzhen Li, Mark Rowland, Daniel HernándezLobato, Thang Bui, and Richard Eric Turner. Blackbox $\alpha $divergence minimization. 2016.
 Higgins et al. [2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 Hinton et al. [1995] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The "wakesleep" algorithm for unsupervised neural networks. SCIENCE, pages 1158–1158, 1995.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long ShortTerm Memory. Neural computation, 9(8):1735–1780, 1997.
 Hoffman and Johnson [2016] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
 Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Houthooft et al. [2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 Hsu et al. [2016] ChinCheng Hsu, HsinTe Hwang, YiChiao Wu, Yu Tsao, and HsinMin Wang. Voice conversion from nonparallel corpora using variational autoencoder. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 AsiaPacific, pages 1–6. IEEE, 2016.
 Hsu et al. [2017] ChinCheng Hsu, HsinTe Hwang, YiChiao Wu, Yu Tsao, and HsinMin Wang. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849, 2017.
 Hu et al. [2017] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text generation. arXiv preprint arXiv:1703.00955, 2017.
 Huang et al. [2018] ChinWei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.
 Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with GumbelSoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Johnson et al. [2016] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pages 2946–2954, 2016.
 Jozefowicz et al. [2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 2342–2350, 2015.
 Kaae Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016.
 Karl et al. [2016] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
 Kavukcuoglu et al. [2008] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. Technical Report CBLLTR20081201, Computational and Biological Learning Lab, Courant Institute, NYU, 2008.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations 2015, 2015.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
 Kingma et al. [2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 Kingma et al. [2015] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 Kingma et al. [2016] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 Kleijnen and Rubinstein [1996] Jack PC Kleijnen and Reuven Y Rubinstein. Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3):413–427, 1996.
 Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Krishnan et al. [2015] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Krishnan et al. [2017] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, pages 2101–2109, 2017.
 Kucukelbir et al. [2016] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. arXiv preprint arXiv:1603.00788, 2016.
 Kulkarni et al. [2015] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
 Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
 Larsen et al. [2015] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 LázaroGredilla [2014] Miguel LázaroGredilla. Doubly stochastic variational Bayes for nonconjugate inference. ICML, 2014.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Li and Turner [2016] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
 Li et al. [2015] Yujia Li, Kevin Swersky, and Richard S Zemel. Generative moment matching networks. In ICML, pages 1718–1727, 2015.
 Linsker [1989] Ralph Linsker. An application of the principle of maximum information preservation to linear systems. Morgan Kaufmann Publishers Inc., 1989.
 Louizos and Welling [2016] Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
 Louizos and Welling [2017] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Louizos et al. [2015] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
 Louizos et al. [2017] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
 Maaløe et al. [2016] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 Maddison et al. [2016] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Makhzani et al. [2015] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Mansimov et al. [2015] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
 Miao et al. [2016] Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In International Conference on Machine Learning, pages 1727–1736, 2016.
 Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In The 31st International Conference on Machine Learning (ICML), 2014.
 Mnih and Rezende [2016] Andriy Mnih and Danilo J Rezende. Variational inference for Monte Carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
 Mohamed and Rezende [2015] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
 Molchanov et al. [2017] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
 Naesseth et al. [2017] Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization gradients through acceptancerejection sampling algorithms. In Artificial Intelligence and Statistics, pages 489–498, 2017.
 Neal [2011] R Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2011.
 Neal and Hinton [1998] Radford M Neal and Geoffrey E Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
 Paisley et al. [2012] John Paisley, David Blei, and Michael Jordan. Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 1367–1374, 2012.
 Papamakarios et al. [2017] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335–2344, 2017.
 Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprint arXiv:1703.01988, 2017.
 Pu et al. [2016] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems, pages 2352–2360, 2016.
 Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 814–822, 2014.
 Ranganath et al. [2015] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. arXiv preprint arXiv:1511.02386, 2015.
 Ravanbakhsh et al. [2016] Siamak Ravanbakhsh, Francois Lanusse, Rachel Mandelbaum, Jeff Schneider, and Barnabas Poczos. Enabling dark energy science with deep generative models of galaxy images. arXiv preprint arXiv:1609.05796, 2016.
 Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pages 1530–1538, 2015.
 Rezende et al. [2014] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1278–1286, 2014.
 Rezende et al. [2016a] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances In Neural Information Processing Systems, pages 4997–5005, 2016a.
 Rezende et al. [2016b] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. Oneshot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016b.
 Roeder et al. [2017] Geoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically zerovariance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.
 Rosca et al. [2018] Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distribution matching in variational inference. arXiv preprint arXiv:1802.06847, 2018.
 Roweis [1998] Sam Roweis. EM algorithms for PCA and SPCA. Advances in neural information processing systems, pages 626–632, 1998.
 Ruiz et al. [2016] Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pages 460–468, 2016.
 Rumelhart et al. [1988] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Salakhutdinov and Larochelle [2010] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, pages 693–700, 2010.
 Salimans [2016] Tim Salimans. A structured variational autoencoder for learning deep hierarchies of sparse features. arXiv preprint arXiv:1602.08734, 2016.
 Salimans and Knowles [2013] Tim Salimans and David A Knowles. Fixedform variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4), 2013.
 Salimans et al. [2015] Tim Salimans, Diederik P Kingma, and Max Welling. Markov Chain Monte Carlo and variational inference: Bridging the gap. In ICML, volume 37, pages 1218–1226, 2015.
 Semeniuta et al. [2017] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390, 2017.
 Serban et al. [2016] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoderdecoder model for generating dialogues. arXiv preprint arXiv:1605.06069, 2016.
 SohlDickstein et al. [2015] Jascha SohlDickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
 Sønderby et al. [2016a] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, S Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016a.
 Sønderby et al. [2016b] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016b.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 Tomczak and Welling [2016] Jakub M Tomczak and Max Welling. Improving variational autoencoders using householder flow. arXiv preprint arXiv:1611.09630, 2016.
 Tomczak and Welling [2017] Jakub M Tomczak and Max Welling. Improving variational autoencoders using convex combination linear inverse autoregressive flow. arXiv preprint arXiv:1706.02326, 2017.
 Tran et al. [2015] Dustin Tran, Rajesh Ranganath, and David M Blei. The variational Gaussian process. arXiv preprint arXiv:1511.06499, 2015.
 Tran et al. [2017] Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene Brevdo, Kevin Murphy, and David M Blei. Deep probabilistic programming. arXiv preprint arXiv:1701.03757, 2017.
 van den Oord et al. [2016a] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016a.
 van den Oord et al. [2016b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
 Vincent et al. [2010] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
 Watter et al. [2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pages 2746–2754, 2015.
 White [2016] Tom White. Sampling generative networks: Notes on a few effective techniques. arXiv preprint arXiv:1609.04468, 2016.
 Williams [1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Wingate and Weber [2013] David Wingate and Theophane Weber. Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299, 2013.
 Xu et al. [2017] Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for semisupervised text classification. In AAAI, pages 3358–3364, 2017.
 Yan et al. [2016] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
 Yang et al. [2017] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017.
 Yeh et al. [2016] Raymond Yeh, Ziwei Liu, Dan B Goldman, and Aseem Agarwala. Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961, 2016.
 Zhao et al. [2017] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960, 2017.