An Introduction to Variational Autoencoders

  • 2019-06-06 16:35:38
  • Diederik P. Kingma, Max Welling
  • 136


Variational autoencoders provide a principled framework for learning deeplatent-variable models and corresponding inference models. In this work, weprovide an introduction to variational autoencoders and some importantextensions.


Quick Read (beta)

An Introduction to
Variational Autoencoders
December 2017

Diederik P. Kingma
   Max Welling

Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. In this work, we provide an introduction to variational autoencoders and some important extensions.


Chapter 1 Introduction

1.1 Motivation

One major division in machine learning is generative versus discriminative modeling. While in discriminative modeling one aims to learn a predictor given the observations, in generative modeling one aims to solve the more general problem of learning a joint distribution over all the variables. A generative model simulates how the data is generated in the real world. “Modeling” is understood in almost every science as unveiling this generating process by hypothesizing theories and testing these theories through observations. For instance, when meteorologists model the weather they use highly complex partial differential equations to express the underlying physics of the weather. Or when an astronomer models the formation of galaxies s/he encodes in his/her equations of motion the physical laws under which stellar bodies interact. The same is true for biologists, chemists, economists and so on. Modeling in the sciences is in fact almost always generative modeling.

There are many reasons why generative modeling is attractive. First, we can express physical laws and constraints into the generative process while details that we don’t know or care about, i.e. nuisance variables, are treated as noise. The resulting models are usually highly intuitive and interpretable and by testing them against observations we can confirm or reject our theories about how the world works.

Another reason for trying to understand the generative process of data is that it naturally expresses causal relations of the world. Causal relations have the great advantage that they generalize much better to new situations than mere correlations. For instance, once we understand the generative process of an earthquake, we can use that knowledge both in California and in Chile.

To turn a generative model into a discriminator, we need to use Bayes rule. For instance, we have a generative model for an earthquake of type A and another for type B, then seeing which of the two describes the data best we can compute a probability for whether earthquake A or B happened. Applying Bayes rule is however often computationally expensive.

In discriminative methods we directly learn a map in the same direction as we intend to make future predictions in. This is in the opposite direction than the generative model. For instance, one can argue that an image is generated in the world by first identifying the object, then generating the object in 3D and then projecting it onto an pixel grid. A discriminative model takes these pixel values directly as input and maps them to the labels. While generative models can learn efficiently from data, they also tend to make stronger assumptions on the data than their purely discriminative counterparts, often leading to higher asymptotic bias (Banerjee, 2008) when the model is wrong. For this reason, if the model is wrong (and it almost always is to some degree!), if one is solely interested in learning to discriminate, and one is in a regime with a sufficiently large amount of data, then purely discriminative models typically will lead to fewer errors in discriminative tasks. Nevertheless, depending on how much data is around, it may pay off to study the data generating process as a way to guide the training of the discriminator, such as a classifier. For instance, one may have few labeled examples and many more unlabeled examples. In this semi-supervised learning setting, one can use the generative model of the data to improve classification (Kingma et al., 2014; Kaae Sønderby et al., 2016).

Generative modeling can be useful more generally. One can think of it as an auxiliary task. For instance, predicting the immediate future may help us build useful abstractions of the world that can be used for multiple prediction tasks downstream. This quest for disentangled, semantically meaningful, statistically independent and causal factors of variation in data is generally known as unsupervised representation learning and indeed the VAE has been extensively employed for that purpose. Alternatively, one may view this as an implicit form of regularization: by forcing the representations to be meaningful for data generation, we bias the inverse of that process, which maps from input to representation, into a certain mould. The auxiliary task of predicting the world is used to better understand the world at an abstract level and thus to better make downstream predictions.

The VAE can be viewed as two coupled, but independently parameterized models: the encoder or recognition model, and the decoder or generative model. These two models support each other. The recognition model delivers to the generative model an approximation to its posterior over latent random variables, which it needs to update its parameters inside an iteration of “expectation maximization” learning. Reversely, the generative model is a scaffolding of sorts for the recognition model to learn meaningful representations of the data, including possibly class-labels. The recognition model is the approximate inverse of the generative model according to Bayes rule.

One advantage of the VAE framework, relative to ordinary Variational Inference (VI), is that the recognition model (also called inference model) is now a (stochastic) function of the input variables. This in contrast to VI where each data-case has a separate variational distribution, which is inefficient for large data-sets. The recognition model uses one set of parameters to model the relation between input and latent variables and as such is called “amortized inference”. This recognition model can be arbitrary complex but is still reasonably fast because by construction it can be done using a single feedforward pass from input to latent variables. However the price we pay is that this sampling induces sampling noise in the gradients required for learning. Perhaps the greatest contribution of the VAE framework is the realization that we can counteract this variance by using what is now known as the “reparameterization trick”, a simple procedure to reorganize our gradient computation that reduces variance in the gradients.

The VAE is inspired by the Helmholtz Machine Dayan et al. (1995) which was perhaps the first model that employed a recognition model. However, its wake-sleep algorithm was inefficient and didn’t optimize a single objective. The VAE learning rules instead follow from a single approximation to the maximum likelihood objective.

VAEs marry graphical models and deep learning. The generative model is a Bayesian network of the form p(𝐱|𝐳)p(𝐳), or, if there are multiple stochastic latent layers, a hierarchy such as p(𝐱|𝐳L)p(𝐳L|𝐳L-1)p(𝐳1|𝐳0). Similarly, the recognition model is also a conditional Bayesian network of the form q(𝐳|𝐱) or as a hierarchy, such as q(𝐳0|𝐳1)q(𝐳L|X). But inside each conditional may hide a complex (deep) neural network, e.g. 𝐳|𝐱f(𝐱,ϵ), with f a neural network mapping and ϵ a noise random variable. Its learning algorithm is a mix of classical (amortized, variational) expectation maximization but through the reparameterization trick ends up backpropagating through the many layers of the deep neural networks embedded inside of it.

Since its inception, the VAE framework has been extended in many directions, e.g. to dynamical models (Johnson et al., 2016), models with attention (Gregor et al., 2015), models with multiple levels of stochastic latent variables Kingma et al. (2016), and many more. It has proven itself as a fertile framework to build new models in. More recently, another generative modeling paradigm has gained significant attention: the generative adversarial network (GAN) (Goodfellow et al., 2014). VAEs and GANs seem to have complementary properties: while GANs can generate images of high subjective perceptual quality, they tend to lack full support over the data (Grover et al., 2017), as opposed to likelihood-based generative models. VAEs, like other likelihood-based models, generate more dispersed samples, but are better density models in terms of the likelihood criterion. As such many hybrid models have been proposed to try to represent the best of both worlds (Dumoulin et al., 2016; Grover et al., 2017; Rosca et al., 2018).

As a community we seem to have embraced the fact that generative models and unsupervised learning play an important role in building intelligent machines. We hope that the VAE provides a useful piece of that puzzle.

1.2 Aim

The framework of variational autoencoders (VAEs) (Kingma and Welling, 2013; Rezende et al., 2014) provides a principled method for jointly learning deep latent-variable models and corresponding inference models using stochastic gradient descent. The framework has a wide array of applications from generative modeling, semi-supervised learning to representation learning.

This work is meant as an expanded version of our earlier work (Kingma and Welling, 2013), allowing us to explain the topic in finer detail and to discuss a selection of important follow-up work. This is not aimed to be a comprehensive review of all related work. We assume that the reader has basic knowledge of algebra, calculus and probability theory.

In this chapter we discuss background material: probabilistic models, directed graphical models, the marriage of directed graphical models with neural networks, learning in fully observed models and deep latent-variable models (DLVMs). In chapter 2 we explain the basics of VAEs. In chapter 3 we explain advanced inference techniques, followed by an explanation of advanced generative models in chapter 4. Please refer to section 6.1 for more information on mathematical notation.

1.3 Probabilistic Models and Variational Inference

In the field of machine learning, we are often interested in learning probabilistic models of various natural and artificial phenomena from data. Probabilistic models are mathematical descriptions of such phenomena. They are useful for understanding such phenomena, for prediction of unknowns in the future, and for various forms of assisted or automated decision making. As such, probabilistic models formalize the notion of knowledge and skill, and are central constructs in the field of machine learning and AI.

As probabilistic models contain unknowns and the data rarely paints a complete picture of the unknowns, we typically need to assume some level of uncertainty over aspects of the model. The degree and nature of this uncertainty is specified in terms of (conditional) probability distributions. Models may consist of both continuous-valued variables and discrete-valued variables. The, in some sense, most complete forms of probabilistic models specify all correlations and higher-order dependencies between the variables in the model, in the form of a joint probability distribution over those variables.

Let’s use 𝐱 as the vector representing the set of all observed variables whose joint distribution we would like to model. Note that for notational simplicity and to avoid clutter, we use lower case bold (e.g. 𝐱) to denote the underlying set of observed random variables, i.e. flattened and concatenated such that the set is represented as a single vector. See section 6.1 for more on notation.

We assume the observed variable 𝐱 is a random sample from an unknown underlying process, whose true (probability) distribution p*(𝐱) is unknown. We attempt to approximate this underlying process with a chosen model p𝜽(𝐱), with parameters 𝜽:

𝐱 p𝜽(𝐱) (1.1)

Learning is, most commonly, the process of searching for a value of the parameters 𝜽 such that the probability distribution function given by the model, p𝜽(𝐱), approximates the true distribution of the data, denoted by p*(𝐱), such that for any observed 𝐱:

p𝜽(𝐱)p*(𝐱) (1.2)

Naturally, we wish p𝜽(𝐱) to be sufficiently flexible to be able to adapt to the data, such that we have a chance of obtaining a sufficiently accurate model. At the same time, we wish to be able to incorporate knowledge about the distribution of data into the model that is known a priori.

1.3.1 Conditional Models

Often, such as in case of classification or regression problems, we are not interested in learning an unconditional model p𝜽(𝐱), but a conditional model p𝜽(𝐲|𝐱) that approximates the underlying conditional distribution p*(𝐲|𝐱): a distribution over the values of variable 𝐲, conditioned on the value of an observed variable 𝐱. In this case, 𝐱 is often called the input of the model. Like in the unconditional case, a model p𝜽(𝐲|𝐱) is chosen, and optimized to be close to the unknown underlying distribution, such that for any 𝐱 and 𝐲:

p𝜽(𝐲|𝐱)p*(𝐲|𝐱) (1.3)

A relatively common and simple example of conditional modeling is image classification, where 𝐱 is an image, and 𝐲 is the image’s class, as labeled by a human, which we wish to predict. In this case, p𝜽(𝐲|𝐱) is typically chosen to be a categorical distribution, whose parameters are computed from 𝐱.

Conditional models become more difficult to learn when the predicted variables are very high-dimensional, such as images, video or sound. One example is the reverse of the image classification problem: prediction of a distribution over images, conditioned on the class label. Another example with both high-dimensional input, and high-dimensional output, is time series prediction, such as text or video prediction.

To avoid notational clutter we will often assume unconditional modeling, but one should always keep in mind that the methods introduced in this work are, in almost all cases, equally applicable to conditional models. The data on which the model is conditioned, can be treated as inputs to the model, similar to the parameters of the model, with the obvious difference that one doesn’t optimize over their value.

1.4 Parameterizing Conditional Distributions with Neural Networks

Differentiable feed-forward neural networks, from here just called neural networks, are a particularly flexible and computationally scalable type of function approximator. Learning of models based on neural networks with multiple ’hidden’ layers of artificial neurons is often referred to as deep learning  (Goodfellow et al., 2016; LeCun et al., 2015). A particularly interesting application is probabilistic models, i.e. the use of neural networks for probability density functions (PDFs) or probability mass functions (PMFs) in probabilistic models. Probabilistic models based on neural networks are computationally scalable since they allow for stochastic gradient-based optimization which, as we will explain, allows scaling to large models and large datasets. We will denote a deep neural network as a vector function: NeuralNet(.).

At the time of writing, deep learning has been shown to work well for a large variety of classification and regression problems, as summarized in (LeCun et al., 2015; Goodfellow et al., 2016). In case of neural-network based image classification LeCun et al. (1998), for example, neural networks parameterize a categorical distribution p𝜽(y|𝐱) over a class label y, conditioned on an image 𝐱.

𝐩 =NeuralNet(𝐱) (1.4)
p𝜽(y|𝐱) =Categorical(y;𝐩) (1.5)

where the last operation of NeuralNet(.) is typically a softmax() function such that ipi=1.

1.5 Directed Graphical Models and Neural Networks

We work with directed probabilistic models, also called directed probabilistic graphical models (PGMs), or Bayesian networks. Directed graphical models are a type of probabilistic models where all the variables are topologically organized into a directed acyclic graph. The joint distribution over the variables of such models factorizes as a product of prior and conditional distributions:

p𝜽(𝐱1,,𝐱M)=j=1Mp𝜽(𝐱j|Pa(𝐱j)) (1.6)

where Pa(𝐱j) is the set of parent variables of node j in the directed graph. For non-root-nodes, we condition on the parents. For root nodes, the set of parents is the empty set, such that the distribution is unconditional.

Traditionally, each conditional probability distribution p𝜽(𝐱j|Pa(𝐱j)) is parameterized as a lookup table or a linear model (Koller and Friedman, 2009). As we explained above, a more flexible way to parameterize such conditional distributions is with neural networks. In this case, neural networks take as input the parents of a variable in a directed graph, and produce the distributional parameters 𝜼 over that variable:

𝜼 =NeuralNet(Pa(𝐱)) (1.7)
p𝜽(𝐱|Pa(𝐱)) =p𝜽(𝐱|𝜼) (1.8)

We will now discuss how to learn the parameters of such models, if all the variables are observed in the data.

1.6 Learning in Fully Observed Models with Neural Nets

If all variables in the directed graphical model are observed in the data, then we can compute and differentiate the log-probability of the data under the model, leading to relatively straightforward optimization.

1.6.1 Dataset

We often collect a dataset 𝒟 consisting of N1 datapoints:

𝒟={𝐱(1),𝐱(2),,𝐱(N)}{𝐱(i)}i=1N𝐱(1:N) (1.9)

The datapoints are assumed to be independent samples from an unchanging underlying distribution. In other words, the dataset is assumed to consist of distinct, independent measurements from the same (unchanging) system. In this case, the observations 𝒟={𝐱(i)}i=1N are said to be i.i.d., for independently and identically distributed. Under the i.i.d. assumption, the probability of the datapoints given the parameters factorizes as a product of individual datapoint probabilities. The log-probability assigned to the data by the model is therefore given by:

logp𝜽(𝒟) =𝐱𝒟logp𝜽(𝐱) (1.10)

1.6.2 Maximum Likelihood and Minibatch SGD

The most common criterion for probabilistic models is maximum log-likelihood (ML). As we will explain, maximization of the log-likelihood criterion is equivalent to minimization of a Kullback Leibler divergence between the data and model distributions.

Under the ML criterion, we attempt to find the parameters 𝜽 that maximize the sum, or equivalently the average, of the log-probabilities assigned to the data by the model. With i.i.d. dataset 𝒟 of size N𝒟, the maximum likelihood objective is to maximize the log-probability given by equation (1.10).

Using calculus’ chain rule and automatic differentiation tools, we can efficiently compute gradients of this objective, i.e. the first derivatives the objective w.r.t. its parameters 𝜽. We can use such gradients to iteratively hill-climb to a local optimum of the ML objective. If we compute such gradients using all datapoints, 𝜽logp𝜽(𝒟), then this is known as batch gradient descent. Computation of this derivative is, however, an expensive operation for large dataset size N𝒟, since it scales linearly with N𝒟.

A more efficient method for optimization is stochastic gradient descent (SGD) (section 6.3), which uses randomly drawn minibatches of data 𝒟 of size N. With such minibatches we can form an unbiased estimator of the ML criterion:

1N𝒟logp𝜽(𝒟)1Nlogp𝜽()=1N𝐱logp𝜽(𝐱) (1.11)

The symbol means that one of the two sides is an unbiased estimator of the other side. So one side (in this case the right-hand side) is a random variable due to some noise source, and the two sides are equal when averaged over the noise distribution. The noise source, in this case, is the randomly drawn minibatch of data . The unbiased estimator logp𝜽() is differentiable, yielding the unbiased stochastic gradients:

1N𝒟𝜽logp𝜽(𝒟)1N𝜽logp𝜽()=1N𝐱𝜽logp𝜽(𝐱) (1.12)

These gradients can be plugged into stochastic gradient-based optimizers; see section 6.3 for further discussion. In a nutshell, we can optimize the objective function by repeatedly taking small steps in the direction of the stochastic gradient.

1.6.3 Bayesian inference

From a Bayesian perspective, we can improve upon ML through maximum a posteriori (MAP) estimation (section section  6.2.1), or, going even further, inference of a full approximate posterior distribution over the parameters (see section  6.1.4).

1.7 Learning and Inference in Deep Latent Variable Models

1.7.1 Latent Variables

We can extend fully-observed directed models, discussed in the previous section, into directed models with latent variables. Latent variables are variables that are part of the model, but which we don’t observe, and are therefore not part of the dataset. We typically use 𝐳 to denote such latent variables. In case of unconditional modeling of observed variable 𝐱, the directed graphical model would then represent a joint distribution p𝜽(𝐱,𝐳) over both the observed variables 𝐱 and the latent variables 𝐳. The marginal distribution over the observed variables p𝜽(𝐱), is given by:

p𝜽(𝐱)=p𝜽(𝐱,𝐳)𝑑𝐳 (1.13)

This is also called the (single datapoint) marginal likelihood or the model evidence, when taking as a function of 𝜽.

Such an implicit distribution over 𝐱 can be quite flexible. If 𝐳 is discrete and p𝜽(𝐱|𝐳) is a Gaussian distribution, then p𝜽(𝐱) is a mixture-of-Gaussians distribution. For continuous 𝐳 (which is generally more efficient to work with due to the reparameterization trick), p𝜽(𝐱) can be seen as an infinite mixture, which are potentially more powerful than discrete mixtures. Such marginal distributions ares also called compound probability distributions.

1.7.2 Deep Latent Variable Models

We use the term deep latent variable model (DLVM) to denote a latent variable model p𝜽(𝐱,𝐳) whose distributions are parameterized by neural networks. Such a model can be conditioned on some context, like p𝜽(𝐱,𝐳|𝐲). DLVMs are especially useful when inference is computationally affordable, such as representation learning and artificial creativity

One important advantage of DLVMs, is that even when each factor (prior or conditional distribution) in the directed model is relatively simple (such as conditional Gaussian), the marginal distribution p𝜽(𝐱) can be very complex, i.e. contain almost arbitrary dependencies. This expressivity makes deep latent-variable models attractive for approximating complicated underlying distributions p*(𝐱).

Perhaps the simplest, and most common, graphical model with latent variables is one that is specified as factorization with the following structure:

p𝜽(𝐱,𝐳)=p𝜽(𝐳)p𝜽(𝐱|𝐳) (1.14)

where p𝜽(𝐳) and/or p𝜽(𝐱|𝐳) are specified. The distribution p(𝐳) is often called the prior distribution over 𝐳, since it is not conditioned on any observations.

1.7.3 Example DLVM for multivariate Bernoulli data

A simple example DLVM, used in  (Kingma and Welling, 2013) for binary data 𝐱, is with a spherical Gaussian latent space, and a factorized Bernoulli observation model:

p(𝐳) =𝒩(𝐳;0,𝐈) (1.15)
𝐩 =DecoderNeuralNet𝜽(𝐳) (1.16)
logp(𝐱|𝐳) =j=1Dlogp(xj|𝐳)=j=1DlogBernoulli(xj;pj) (1.17)
=j=1Dxjlogpj+(1-xj)log(1-pj) (1.18)

where pj𝐩:0pj1 (e.g. implemented through a sigmoid nonlinearity as the last layer of the DecoderNeuralNet𝜽(.)), where D is the dimensionality of 𝐱, and Bernoulli(.;p) is the probability mass function (PMF) of the Bernoulli distribution.

1.8 Intractabilities

The main difficulty of maximum likelihood learning in DLVMs is that the marginal probability of data under the model is typically intractable. This is due to the integral in equation (1.13) for computing the marginal likelihood (or model evidence), p𝜽(𝐱)=p𝜽(𝐱,𝐳)𝑑𝐳, not having an analytic solution or efficient estimator. Due to this intractability, we cannot differentiate it w.r.t. its parameters and optimize it, as we can with fully observed models.

The intractability of p𝜽(𝐱), is related to the intractability of the posterior distribution p𝜽(𝐳|𝐱). Note that the joint distribution p𝜽(𝐱,𝐳) is efficient to compute, and that the densities are related through the basic identity:

p𝜽(𝐳|𝐱)=p𝜽(𝐱,𝐳)p𝜽(𝐱) (1.19)

Since p𝜽(𝐱,𝐳) is tractable to compute, a tractable marginal likelihood p𝜽(𝐱) leads to a tractable posterior p𝜽(𝐳|𝐱), and vice versa. Both are intractable in DLVMs.

Approximate inference techniques (see also section  6.2) allow us to approximate the posterior p𝜽(𝐳|𝐱) and the marginal likelihood p𝜽(𝐱) in DLVMs. Traditional inference methods are relatively expensive. Such methods, for example, often require a per-datapoint optimization loop, or yield bad posterior approximations. We would like to avoid such expensive procedures.

Likewise, the posterior over the parameters of (directed models parameterized with) neural networks, p(𝜽|𝒟), is generally intractable to compute exactly, and requires approximate inference techniques.

Chapter 2 Variational Autoencoders

In this chapter we explain the basics of variational autoencoders (VAEs).

2.1 Encoder or Approximate Posterior

In the previous chapter, we introduced deep latent-variable models (DLVMs), and the problem of estimating the log-likelihood and posterior distributions in such models. The framework of variational autoencoders (VAEs) provides a computationally efficient way for optimizing DLVMs jointly with a corresponding inference model using SGD.

To turn the DLVM’s intractable posterior inference and learning problems into tractable problems, we introduce a parametric inference model qϕ(𝐳|𝐱). This model is also called an encoder or recognition model. With ϕ we indicate the parameters of this inference model, also called the variational parameters. We optimize the variational parameters ϕ such that:

qϕ(𝐳|𝐱)p𝜽(𝐳|𝐱) (2.1)

As we will explain, this approximation to the posterior help us optimize the marginal likelihood.

Like a DLVM, the inference model can be (almost) any directed graphical model:

qϕ(𝐳|𝐱)=qϕ(𝐳1,,𝐳M|𝐱)=j=1Mqϕ(𝐳j|Pa(𝐳j),𝐱) (2.2)

where Pa(𝐳j) is the set of parent variables of variable 𝐳j in the directed graph. And also similar to a DLVM, the distribution qϕ(𝐳|𝐱) can be parameterized using deep neural networks. In this case, the variational parameters ϕ include the weights and biases of the neural network. For example:

(𝝁,log𝝈) =EncoderNeuralNetϕ(𝐱) (2.3)
qϕ(𝐳|𝐱) =𝒩(𝐳;𝝁,diag(𝝈)) (2.4)

Typically, we use a single encoder neural network to perform posterior inference over all of the datapoints in our dataset. This can be contrasted to more traditional variational inference methods where the variational parameters are not shared, but instead separately and iteratively optimized per datapoint. The strategy used in VAEs of sharing variational parameters across datapoints is also called amortized variational inference (Gershman and Goodman, 2014). With amortized inference we can avoid a per-datapoint optimization loop, and leverage the efficiency of SGD.

2.2 Evidence Lower Bound (ELBO)

Figure 2.1: A VAE learns stochastic mappings between an observed 𝐱-space, whose empirical distribution q𝒟(𝐱) is typically complicated, and a latent 𝐳-space, whose distribution can be relatively simple (such as spherical, as in this figure). The generative model learns a joint distribution p𝜽(𝐱,𝐳) that is often (but not always) factorized as p𝜽(𝐱,𝐳)=p𝜽(𝐳)p𝜽(𝐱|𝐳), with a prior distribution over latent space p𝜽(𝐳), and a stochastic decoder p𝜽(𝐱|𝐳). The stochastic encoder qϕ(𝐳|𝐱), also called inference model, approximates the true but intractable posterior p𝜽(𝐳|𝐱) of the generative model.

The optimization objective of the variational autoencoder, like in other variational methods, is the evidence lower bound, abbreviated as ELBO. An alternative term for this objective is variational lower bound. Typically, the ELBO is derived through Jensen’s inequality. Here we will use an alternative derivation that avoids Jensen’s inequality, providing greater insight about its tightness.

For any choice of inference model qϕ(𝐳|𝐱), including the choice of variational parameters ϕ, we have:

logp𝜽(𝐱) =𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱)] (2.5)
=𝔼qϕ(𝐳|𝐱)[log[p𝜽(𝐱,𝐳)p𝜽(𝐳|𝐱)]] (2.6)
=𝔼qϕ(𝐳|𝐱)[log[p𝜽(𝐱,𝐳)qϕ(𝐳|𝐱)qϕ(𝐳|𝐱)p𝜽(𝐳|𝐱)]] (2.7)
=𝔼qϕ(𝐳|𝐱)[log[p𝜽(𝐱,𝐳)qϕ(𝐳|𝐱)]]=𝜽,ϕ(𝐱) (ELBO)+𝔼qϕ(𝐳|𝐱)[log[qϕ(𝐳|𝐱)p𝜽(𝐳|𝐱)]]=DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱)) (2.8)

The second term in eq. (2.8) is the Kullback-Leibler (KL) divergence between qϕ(𝐳|𝐱) and p𝜽(𝐳|𝐱), which is non-negative:

DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱))0 (2.9)

and zero if, and only if, qϕ(𝐳|𝐱) equals the true posterior distribution.

The first term in eq. (2.8) is the variational lower bound, also called the evidence lower bound (ELBO):

𝜽,ϕ(𝐱)=𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)] (2.10)

Due to the non-negativity of the KL divergence, the ELBO is a lower bound on the log-likelihood of the data.

𝜽,ϕ(𝐱) =logp𝜽(𝐱)-DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱)) (2.11)
logp𝜽(𝐱) (2.12)

So, interestingly, the KL divergence DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱)) determines two ’distances’:

  1. 1.

    By definition, the KL divergence of the approximate posterior from the true posterior;

  2. 2.

    The gap between the ELBO 𝜽,ϕ(𝐱) and the marginal likelihood logp𝜽(𝐱); this is also called the tightness of the bound. The better qϕ(𝐳|𝐱) approximates the true (posterior) distribution p𝜽(𝐳|𝐱), in terms of the KL divergence, the smaller the gap.

2.2.1 Two for One

By looking at equation 2.11, it can be understood that maximization of the ELBO 𝜽,ϕ(𝐱) w.r.t. the parameters 𝜽 and ϕ, will concurrently optimize the two things we care about:

  1. 1.

    It will approximately maximize the marginal likelihood p𝜽(𝐱). This means that our generative model will become better.

  2. 2.

    It will minimize the KL divergence of the approximation qϕ(𝐳|𝐱) from the true posterior p𝜽(𝐳|𝐱), so qϕ(𝐳|𝐱) becomes better.

Figure 2.2: Simple schematic of computational flow in a variational autoencoder.

2.3 Stochastic Gradient-Based Optimization of the ELBO

An important property of the ELBO, is that it allows joint optimization w.r.t. all parameters (ϕ and 𝜽) using stochastic gradient descent (SGD). We can start out with random initial values of ϕ and 𝜽, and stochastically optimize their values until convergence.

Given a dataset with i.i.d. data, the ELBO objective is the sum (or average) of individual-datapoint ELBO’s:

𝜽,ϕ(𝒟)=𝐱𝒟𝜽,ϕ(𝐱) (2.13)

The individual-datapoint ELBO, and its gradient 𝜽,ϕ𝜽,ϕ(𝐱) is, in general, intractable. However, good unbiased estimators ~𝜽,ϕ𝜽,ϕ(𝐱) exist, as we will show, such that we can still perform minibatch SGD.

Unbiased gradients of the ELBO w.r.t. the generative model parameters are simple to obtain:

𝜽𝜽,ϕ(𝐱) =𝜽𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)] (2.14)
=𝔼qϕ(𝐳|𝐱)[𝜽(logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱))] (2.15)
𝜽(logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)) (2.16)
=𝜽(logp𝜽(𝐱,𝐳)) (2.17)

The last line (eq. (2.17)) is a simple Monte Carlo estimator of the second line (eq. (2.15)), where 𝐳 in (2.17), the last line, is a random sample from qϕ(𝐳|𝐱).

Unbiased gradients w.r.t. the variational parameters ϕ are more difficult to obtain, since the ELBO’s expectation is taken w.r.t. the distribution qϕ(𝐳|𝐱), which is a function of ϕ. I.e., in general:

ϕ𝜽,ϕ(𝐱) =ϕ𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)] (2.18)
𝔼qϕ(𝐳|𝐱)[ϕ(logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱))] (2.19)

In case of continuous latent variables, we can use a reparameterization trick for computing unbiased estimated of 𝜽,ϕ𝜽,ϕ(𝐱), as we will now discuss. This stochastic estimate allows us to optimize the ELBO using SGD; see algorithm 1. See section 2.9.1 for a discussion of variational methods for discrete latent variables.

\SetKwInOutInputinput \SetKwInOutOutputoutput \KwData
     𝒟: Dataset
     qϕ(𝐳|𝐱): Inference model
     p𝜽(𝐱,𝐳): Generative model
     𝜽,ϕ: Learned parameters
\BlankLine(𝜽,ϕ)Initialize parameters
\WhileSGD not converged 𝒟 (Random minibatch of data)
ϵp(ϵ) (Random noise for every datapoint in )
Compute ~𝜽,ϕ(,ϵ) and its gradients 𝜽,ϕ~𝜽,ϕ(,ϵ)
Update 𝜽 and ϕ using SGD optimizer
\algorithmcfname 1 Stochastic optimization of the ELBO. Since noise originates from both the minibatch sampling and sampling of p(ϵ), this is a doubly stochastic optimization procedure. We also refer to this procedure as the Auto-Encoding Variational Bayes (AEVB) algorithm.

2.4 Reparameterization Trick

Figure 2.3: Illustration of the reparameterization trick. The variational parameters ϕ affect the objective f through the random variable 𝐳qϕ(𝐳|𝐱). We wish to compute gradients ϕf to optimize the objective with SGD. In the original form (left), we cannot differentiate f w.r.t. ϕ, because we cannot directly backpropagate gradients through the random variable 𝐳. We can ’externalize’ the randomness in 𝐳 by re-parameterizing the variable as a deterministic and differentiable function of ϕ, 𝐱, and a newly introduced random variable ϵ. This allows us to ’backprop through 𝐳’, and compute gradients ϕf.

For continuous latent variables and a differentiable encoder and generative model, the ELBO can be straightforwardly differentiated w.r.t. both ϕ and 𝜽 through a change of variables, also called the reparameterization trick (Kingma and Welling (2013) and Rezende et al. (2014)).

2.4.1 Change of variables

First, we express the random variable 𝐳qϕ(𝐳|𝐱) as some differentiable (and invertible) transformation of another random variable ϵ, given 𝐳 and ϕ:

𝐳=𝐠(ϵ,ϕ,𝐱) (2.20)

where the distribution of random variable ϵ is independent of 𝐱 or ϕ.

2.4.2 Gradient of expectation under change of variable

Given such a change of variable, expectations can be rewritten in terms of ϵ,

𝔼qϕ(𝐳|𝐱)[f(𝐳)]=𝔼p(ϵ)[f(𝐳)] (2.21)

where 𝐳=𝐠(ϵ,ϕ,𝐱). and the expectation and gradient operators become commutative, and we can form a simple Monte Carlo estimator:

ϕ𝔼qϕ(𝐳|𝐱)[f(𝐳)] =ϕ𝔼p(ϵ)[f(𝐳)] (2.22)
=𝔼p(ϵ)[ϕf(𝐳)] (2.23)
ϕf(𝐳) (2.24)

where in the last line, 𝐳=𝐠(ϕ,𝐱,ϵ) with random noise sample ϵp(ϵ). See figure 2.3 for an illustration and further clarification.

2.4.3 Gradient of ELBO

Under the reparameterization, we can replace an expectation w.r.t. qϕ(𝐳|𝐱) with one w.r.t. p(ϵ). The ELBO can be rewritten as:

𝜽,ϕ(𝐱) =𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)] (2.25)
=𝔼p(ϵ)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)] (2.26)

where 𝐳=g(ϵ,ϕ,𝐱).

As a result we can form a simple Monte Carlo estimator ~𝜽,ϕ(𝐱) of the individual-datapoint ELBO where we use a single noise sample ϵ from p(ϵ):

ϵ p(ϵ) (2.27)
𝐳 =𝐠(ϕ,𝐱,ϵ) (2.28)
~𝜽,ϕ(𝐱) =logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱) (2.29)

This series of operations can be expressed as a symbolic graph in software like Tensorflow, and effortlessly differentiated w.r.t. the parameters 𝜽 and ϕ. The resulting gradient ϕ~𝜽,ϕ(𝐱) is used to optimize the ELBO using minibatch SGD. See algorithm 1. This algorithm was originally referred to as the Auto-Encoding Variational Bayes (AEVB) algorithm by  Kingma and Welling (2013). More generally, the reparameterized ELBO estimator was referred to as the Stochastic Gradient Variational Bayes (SGVB) estimator. This estimator can also be used to estimate a posterior over the model parameters, as explained in the appendix of  (Kingma and Welling, 2013). Unbiasedness

This gradient is an unbiased estimator of the exact single-datapoint ELBO gradient; when averaged over noise ϵp(ϵ), this gradient equals the single-datapoint ELBO gradient:

𝔼p(ϵ)[𝜽,ϕ~𝜽,ϕ(𝐱;ϵ)] =𝔼p(ϵ)[𝜽,ϕ(logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱))] (2.30)
=𝜽,ϕ(𝔼p(ϵ)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)]) (2.31)
=𝜽,ϕ𝜽,ϕ(𝐱) (2.32)

2.4.4 Computation of logqϕ(𝐳|𝐱)

Computation of the (estimator of) the ELBO requires computation of the density logqϕ(𝐳|𝐱), given a value of 𝐱, and given a value of 𝐳 or equivalently ϵ. This log-density is a simple computation, as long as we choose the right transformation 𝐠().

Note that we typically know the density p(ϵ), since this is the density of the chosen noise distribution. As long as 𝐠(.) is an invertible function, the densities of ϵ and 𝐳 are related as:

logqϕ(𝐳|𝐱)=logp(ϵ)-logdϕ(𝐱,ϵ) (2.33)

where the second term is the log of the absolute value of the determinant of the Jacobian matrix (𝐳/ϵ):

logdϕ(𝐱,ϵ)=log|det(𝐳ϵ)| (2.34)

We call this the log-determinant of the transformation from ϵ to 𝐳. We use the notation logdϕ(𝐱,ϵ) to make explicit that this log-determinant, similar to 𝐠(), is a function of 𝐱, ϵ and ϕ. The Jacobian matrix contains all first derivatives of the transformation from ϵ to 𝐳:

𝐳ϵ=(z1,,zk)(ϵ1,,ϵk)=(z1ϵ1z1ϵkzkϵ1zkϵk) (2.35)

As we will show, we can build very flexible transformations 𝐠() for which logdϕ(𝐱,ϵ) is simple to compute, resulting in highly flexible inference models qϕ(𝐳|𝐱).

2.5 Factorized Gaussian posteriors

A common choice is a simple factorized Gaussian encoder

(𝝁,log𝝈) =EncoderNeuralNetϕ(𝐱) (2.36)
qϕ(𝐳|𝐱) =iqϕ(zi|𝐱)=i𝒩(zi;μi,σi2) (2.37)

where 𝒩(zi;μi,σi2) is the PDF of the univariate Gaussian distribution. After reparameterization, we can write:

ϵ 𝒩(0,𝐈) (2.38)
(𝝁,log𝝈) =EncoderNeuralNetϕ(𝐱) (2.39)
𝐳 =𝝁+𝝈ϵ (2.40)

where is the element-wise product. The Jacobian of the transformation from ϵ to 𝐳 is:

𝐳ϵ=diag(𝝈), (2.41)

i.e. a diagonal matrix with the elements of 𝝈 on the diagonal. The determinant of a diagonal (or more generally, triangular) matrix is the product of its diagonal terms. The log determinant of the Jacobian is therefore:

logdϕ(𝐱,ϵ)=log|det(𝐳ϵ)|=ilogσi (2.42)

and the posterior density is:

logqϕ(𝐳|𝐱) =logp(ϵ)-logdϕ(𝐱,ϵ) (2.43)
=ilog𝒩(ϵi;0,1)-logσi (2.44)

when z=g(ϵ,ϕ,𝐱).

2.5.1 Full-covariance Gaussian posterior

The factorized Gaussian posterior can be extended to a Gaussian with full covariance:

qϕ(𝐳|𝐱)=𝒩(𝐳;𝝁,𝚺) (2.45)

A reparameterization of this distribution is given by:

ϵ 𝒩(0,𝐈) (2.46)
𝐳 =𝝁+𝐋ϵ (2.47)

where 𝐋 is a lower (or upper) triangular matrix, with non-zero entries on the diagonal. The off-diagonal elements define the correlations (covariances) of the elements in 𝐳.

The reason for this parameterization of the full-covariance Gaussian, is that the Jacobian determinant is remarkably simple. The Jacobian in this case is trivial: 𝐳ϵ=𝐋. Note that the determinant of a triangular matrix is the product of its diagonal elements. Therefore, in this parameterization:

log|det(𝐳ϵ)| =ilog|Lii| (2.48)

And the log-density of the posterior is:

logqϕ(𝐳|𝐱) =logp(ϵ)-ilog|Lii| (2.49)

This parameterization corresponds to the Cholesky decomposition 𝚺=𝐋𝐋T of the covariance of 𝐳:

𝚺 =𝔼[(𝐳-𝔼[𝐳|])(𝐳-𝔼[𝐳|])T] (2.50)
=𝔼[𝐋ϵ(𝐋ϵ)T]=𝐋𝔼[ϵϵT]𝐋T (2.51)
=𝐋𝐋T (2.52)

Note that 𝔼[ϵϵT]=𝐈 since ϵ𝒩(0,𝐈).

One way to build a matrix 𝐋 with the desired properties, namely triangularity and non-zero diagonal entries, is by constructing it as follows:

(𝝁,log𝝈,𝐋)EncoderNeuralNetϕ(𝐱) (2.53)
𝐋𝐋mask𝐋+diag(𝝈) (2.54)

and then proceeding with 𝐳=𝝁+𝐋ϵ as described above. 𝐋mask is a masking matrix with zeros on and above the diagonal, and ones below the diagonal. Note that due to the masking 𝐋, the Jacobian matrix (𝐳/ϵ) is triangular with the values of 𝝈 on the diagonal. The log-determinant is therefore identical to the factorized Gaussian case:

log|det(𝐳ϵ)|=ilogσi (2.55)

More generally, we can replace 𝐳=𝐋ϵ+𝝁 with a chain of (differentiable and nonlinear) transformations; as long as the Jacobian of each step in the chain is triangular with non-zero diagonal entries, the log determinant remains simple. This principle is used by inverse autoregressive flow (IAF) as explored by  Kingma et al. (2016) and discussed in chapter  3.

\SetKwFunctionEncoderNNEncoderNN \SetKwFunctionAutoregressiveNNAutoregressiveNN \SetKwFunctionsigmoidsigmoid \SetKwFunctiondosumsum \SetKwFunctionsigmoidsigmoid \SetKwInOutInputinput \SetKwInOutOutputoutput \KwData
     𝐱: a datapoint, and optionally other conditioning information
     ϵ: a random sample from p(ϵ)=𝒩(0,𝐈)
     𝜽: Generative model parameters
     ϕ: Inference model parameters
     qϕ(𝐳|𝐱): Inference model
     p𝜽(𝐱,𝐳): Generative model
     ~: unbiased estimate of the single-datapoint ELBO 𝜽,ϕ(𝐱)
logqz-\dosum(12(ϵ2+log(2π)+log𝝈)) \Comment=qϕ(𝐳|𝐱)
logpx\dosum(𝐱log𝐩+(1-𝐱)log(1-𝐩)) \Comment=p𝜽(𝐱|𝐳)
\algorithmcfname 2 Computation of unbiased estimate of single-datapoint ELBO for example VAE with a full-covariance Gaussian inference model and a factorized Bernoulli generative model. 𝐋mask is a masking matrix with zeros on and above the diagonal, and ones below the diagonal. Note that 𝐋 must be a triangular matrix with ones on the diagonal.

2.6 Estimation of the Marginal Likelihood

After training a VAE, we can estimate the probability of data under the model using an importance sampling technique, as originally proposed by  Rezende et al. (2014). The marginal likelhood of a datapoint can be written as:

logp𝜽(𝐱)=log𝔼qϕ(𝐳|𝐱)[p𝜽(𝐱,𝐳)/qϕ(𝐳|𝐱)] (2.56)

Taking random samples from qϕ(𝐳|𝐱), a Monte Carlo estimator of this is:

logp𝜽(𝐱)log1Ll=1Lp𝜽(𝐱,𝐳(l))/qϕ(𝐳(l)|𝐱) (2.57)

where each 𝐳(l)qϕ(𝐳|𝐱) is a random sample from the inference model. By making L large, the approximation becomes a better estimate of the marginal likelihood, and in fact since this is a Monte Carlo estimator, for L this converges to the actual marginal likelihood.

Notice that when setting L=1, this equals the ELBO estimator of the VAE. We can also use the estimator of 2.57 as our objective function; this is the objective used in importance weighted autoencoders (Burda et al., 2015) (IWAE). In that paper, it was also shown that the objective has increasing tightness for increasing value of L. It was later shown by  Cremer and Duvenaud (2017) that the IWAE objective can be re-interpreted as an ELBO objective with a particular inference model. The downside of these approaches for optimizing a tighter bound, is that importance weighted estimates have notoriously bad scaling properties to high-dimensional latent spaces.

2.7 Marginal Likelihood and ELBO as KL Divergences

Figure 2.4: The maximum likelihood (ML) objective can be viewed as the minimization of DKL(q𝒟,ϕ(𝐱)||p𝜽(𝐱)), while the ELBO objective can be viewed as the minimization of DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳), which upper bounds DKL(q𝒟,ϕ(𝐱)||p𝜽(𝐱)). If a perfect fit is not possible, then p𝜽(𝐱,𝐳) will typically end up with higher variance than q𝒟,ϕ(𝐱,𝐳), because of the direction of the KL divergence.

One way to improve the potential tightness of the bound, is increasing the flexibility of the generative model. This can be understood through a connection between the ELBO and the KL divergence.

With i.i.d. dataset 𝒟 of size N𝒟, the maximum likelihood criterion is:

logp𝜽(𝒟) =1N𝒟𝐱𝒟logp𝜽(𝐱) (2.58)
=𝔼q𝒟(𝐱)[logp𝜽(𝐱)] (2.59)

where q𝒟(𝐱) is the empirical (data) distribution, which is a mixture distribution:

q𝒟(𝐱)=1Ni=1Nq𝒟(i)(𝐱) (2.60)

where each component q𝒟(i)(𝐱) typically corresponds to a Dirac delta distribution centered at value 𝐱(i) in case of continuous data, or a discrete distribution with all probability mass concentrated at value 𝐱(i) in case of discrete data. The Kullback Leibler (KL) divergence between the data and model distributions, can be rewritten as the negative log-likelihood, plus a constant:

DKL(q𝒟(𝐱)||p𝜽(𝐱)) =-𝔼q𝒟(𝐱)[logp𝜽(𝐱)]+𝔼q𝒟(𝐱)[logq𝒟(𝐱)] (2.61)
=-logp𝜽(𝒟)+constant (2.62)

where constant=-(q𝒟(𝐱)). So minimization of the KL divergence above is equivalent to maximization of the data log-likelihood logp𝜽(𝒟).

Taking the combination of the empirical data distribution q𝒟(𝐱) and the inference model, we get a joint distribution over data 𝐱 and latent variables 𝐳: q𝒟,ϕ(𝐱,𝐳)=q𝒟(𝐱)q(𝐳|𝐱).

The KL divergence of this joint distribution, can be rewritten as the negative ELBO, plus a constant:

DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)) (2.63)
=-𝔼q𝒟(𝐱)[𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)]-logq𝒟(𝐱)] (2.64)
=-𝜽,ϕ(𝒟)+constant (2.65)

where constant=-(q𝒟(𝐱)). So maximization of the ELBO, is equivalent to the minimization of this KL divergence DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)). The relationship between the ML and ELBO objectives can be summarized in the following simple equation:

DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)) (2.66)
=DKL(q𝒟(𝐱)||p𝜽(𝐱))+𝔼q𝒟(𝐱)[DKL(q𝒟,ϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱))] (2.67)
DKL(q𝒟(𝐱)||p𝜽(𝐱)) (2.68)

One additional perspective is that the ELBO can be viewed as a maximum likelihood objective in an augmented space. For some fixed choice of encoder qϕ(𝐳|𝐱), we can view the joint distribution p𝜽(𝐱,𝐳) as an augmented empirical distribution over the original data 𝐱 and (stochastic) auxiliary features 𝐳 associated with each datapoint. The model p𝜽(𝐱,𝐳) then defines a joint model over the original data, and the auxiliary features.

2.8 Challenges

2.8.1 Optimization issues

In our work, consistent with findings in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016), we found that stochastic optimization with the unmodified lower bound objective can gets stuck in an undesirable stable equilibrium. At the start of training, the likelihood term logp(𝐱|𝐳) is relatively weak, such that an initially attractive state is where q(𝐳|𝐱)p(𝐳). In this state, encoder gradients have a relatively low signal-to-noise ratio, resulting in a stable equilibrium from which it is difficult to escape. The solution proposed in (Bowman et al., 2015) and (Kaae Sønderby et al., 2016) is to use an optimization schedule where the weights of the latent cost DKL(q(𝐳|𝐱)||p(𝐳)) is slowly annealed from 0 to 1 over many epochs.

An alternative proposed in (Kingma et al., 2016) is the method of free bits: a modification of the ELBO objective, that ensures that on average, a certain minimum number of bits of information are encoded per latent variable, or per group of latent variables.

The latent dimensions are divided into the K groups. We then use the following minibatch objective, which ensures that using less than λ nats of information per subset j (on average per minibatch ) is not advantageous:

~λ =𝔼𝐱[𝔼q(𝐳|𝐱)[logp(𝐱|𝐳)]] (2.69)
-j=1Kmaximum(λ,𝔼𝐱[DKL(q(zj|𝐱)||p(zj))] (2.70)

Since increasing the latent information is generally advantageous for the first (unaffected) term of the objective (often called the negative reconstruction error), this results in 𝔼𝐱[DKL(q(𝐳j|𝐱)||p(𝐳j))]λ for all j, in practice. In  Kingma et al. (2016) it was found that the method worked well for a fairly wide range of values (λ[0.125,0.25,0.5,1,2]), resulting in significant improvement in the resulting log-likelihood on a benchmark result.

2.8.2 Blurriness of generative model

In section 2.7 we saw that optimizing the ELBO is equivalent to minimizing DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)). If a perfect fit between q𝒟,ϕ(𝐱,𝐳) and p𝜽(𝐱,𝐳) is not possible, then the variance of p𝜽(𝐱,𝐳) and p𝜽(𝐱) will end up larger than the variance q𝒟,ϕ(𝐱,𝐳) and the data q𝒟,ϕ(𝐱). This is due to the direction of the KL divergence; if there are values of (𝐱,𝐳) which are likely under q𝒟,ϕ but not under p𝜽, the term 𝔼q𝒟,ϕ(𝐱,𝐳)[logp𝜽(𝐱,𝐳)] will go to infinity. However, the reverse is not true: the generative model is only slightly penalized when putting probability mass on values of (𝐱,𝐳) with no support under q𝒟,ϕ.

Issues with ’blurriness’ can thus can be countered by choosing a sufficiently flexible inference model, and/or a sufficiently flexible generative model. In the next two chapters we will discuss techniques for constructing flexible inference models and flexible generative models.

2.9 Related prior and concurrent work

Here we briefly discuss relevant literature prior to and concurrent with work in  (Kingma and Welling, 2013).

The wake-sleep algorithm (Hinton et al., 1995) is, to the best of our knowledge, the only other on-line learning method in the literature that is applicable to the same general class of continuous latent variable models. Like our method, the wake-sleep algorithm employs a recognition model that approximates the true posterior. A drawback of the wake-sleep algorithm is that it requires a concurrent optimization of two objective functions, which together do not correspond to optimization of (a bound of) the marginal likelihood. An advantage of wake-sleep is that it also applies to models with discrete latent variables. Wake-Sleep has the same computational complexity as AEVB per datapoint.

Stochastic variational inference Hoffman et al. (2013) has received increasing interest. Blei et al. (2012) introduced a control variate schemes to reduce the variance of the score function gradient estimator, and applied the estimator to exponential family approximations of the posterior. In (Ranganath et al., 2014) some general methods, e.g. a control variate scheme, were introduced for reducing the variance of the original gradient estimator. In Salimans and Knowles (2013), a similar reparameterization as in this work was used in an efficient version of a stochastic variational inference algorithm for learning the natural parameters of exponential-family approximating distributions.

In  Graves (2011) a similar estimator of the gradient is introduced; however the estimator of the variance is not an unbiased estimator w.r.t. the ELBO gradient.

The VAE training algorithm exposes a connection between directed probabilistic models (trained with a variational objective) and auto-encoders. A connection between linear auto-encoders and a certain class of generative linear-Gaussian models has long been known. In  (Roweis, 1998) it was shown that PCA corresponds to the maximum-likelihood (ML) solution of a special case of the linear-Gaussian model with a prior p(𝐳)=𝒩(0,𝐈) and a conditional distribution p(𝐱|𝐳)=𝒩(𝐱;𝐖𝐳,ϵ𝐈), specifically the case with infinitesimally small ϵ. In this limiting case, the posterior over the latent variables p(𝐳|𝐱) is a Dirac delta distribution: p(𝐳|𝐱)=δ(𝐳-𝐖𝐱) where 𝐖=(𝐖T𝐖)-1𝐖T, i.e., given 𝐖 and 𝐱 there is no uncertainty about latent variable 𝐳.  Roweis (1998) then introduces an EM-type approach to learning 𝐖. Much earlier work (Bourlard and Kamp, 1988) showed that optimization of linear autoencoders retrieves the principal components of data, from which it follows that learning linear autoencoders correspond to a specific method for learning the above case of linear-Gaussian probabilistic model of the data. However, this approach using linear autoencoders is limited to linear-Gaussian models, while our approach applies to a much broader class of continuous latent variable models.

When using neural networks for both the inference model and the generative model, the combination forms a type of autoencoder (Goodfellow et al., 2016) with a specific regularization term:

~𝜽,ϕ(𝐱;ϵ)=logp𝜽(𝐱|𝐳)Negative reconstruction error+logp𝜽(𝐳)-logqϕ(𝐳|𝐱)Regularization terms (2.71)

In an analysis of plain autoencoders (Vincent et al., 2010) it was shown that the training criterion of unregularized autoencoders corresponds to maximization of a lower bound (see the infomax principle (Linsker, 1989)) of the mutual information between input X and latent representation Z. Maximizing (w.r.t. parameters) of the mutual information is equivalent to maximizing the conditional entropy, which is lower bounded by the expected log-likelihood of the data under the autoencoding model (Vincent et al., 2010), i.e. the negative reconstruction error. However, it is well known that this reconstruction criterion is in itself not sufficient for learning useful representations Bengio et al. (2013). Regularization techniques have been proposed to make autoencoders learn useful representations, such as denoising, contractive and sparse autoencoder variants  (Bengio et al., 2013). The VAE objective contains a regularization term dictated by the variational bound, lacking the usual nuisance regularization hyper-parameter required to learn useful representations. Related are also encoder-decoder architectures such as the predictive sparse decomposition (PSD) Kavukcuoglu et al. (2008), from which we drew some inspiration. Also relevant are the recently introduced Generative Stochastic Networks Bengio and Thibodeau-Laufer (2013) where noisy auto-encoders learn the transition operator of a Markov chain that samples from the data distribution. In Salakhutdinov and Larochelle (2010) a recognition model was employed for efficient learning with Deep Boltzmann Machines. These methods are targeted at either unnormalized models (i.e. undirected models like Boltzmann machines) or limited to sparse coding models, in contrast to our proposed algorithm for learning a general class of directed probabilistic models.

The proposed DARN method  (Gregor et al., 2013), also learns a directed probabilistic model using an auto-encoding structure, however their method applies to binary latent variables. In concurrent work,  Rezende et al. (2014) also make the connection between auto-encoders, directed probabilistic models and stochastic variational inference using the reparameterization trick we describe in  (Kingma and Welling, 2013). Their work was developed independently of ours and provides an additional perspective on the VAE.

2.9.1 Score function estimator

An alternative unbiased stochastic gradient estimator of the ELBO is the score function estimator (Kleijnen and Rubinstein, 1996):

ϕ𝔼qϕ(𝐳|𝐱)[f(𝐳)] =𝔼qϕ(𝐳|𝐱)[f(𝐳)ϕlogqϕ(𝐳|𝐱)] (2.72)
f(𝐳)ϕlogqϕ(𝐳|𝐱) (2.73)

where 𝐳qϕ(𝐳|𝐱).

This is also known as the likelihood ratio estimator (Glynn, 1990; Fu, 2006) and the REINFORCE gradient estimator  (Williams, 1992). The method has been successfully used in various methods like neural variational inference (Mnih and Gregor, 2014), black-box variational inference (Ranganath et al., 2014), automated variational inference (Wingate and Weber, 2013), and variational stochastic search (Paisley et al., 2012), often in combination with various novel control variate techniques (Glasserman, 2013) for variance reduction. An advantage of the likelihood ratio estimator is its applicability to discrete latent variables.

We do not directly compare to these techniques, since we concern ourselves with continuous latent variables, in which case we have (computationally cheap) access to gradient information 𝐳logp𝜽(𝐱,𝐳), courtesy of the backpropagation algorithm. The score function estimator solely uses the scalar-valued logp𝜽(𝐱,𝐳), ignoring the gradient information about the function logp𝜽(𝐱,𝐳), generally leading to much higher variance. This has been experimentally confirmed by e.g. (Kucukelbir et al., 2016), which finds that a sophisticated score function estimator requires two orders of magnitude more samples to arrive at the same variance as a reparameterization based estimator.

The difference in efficiency of our proposed reparameterization-based gradient estimator, compared to score function estimators, can intuitively be understood as removing an information bottleneck during the computation of gradients of the ELBO w.r.t. ϕ from current parameters 𝜽: in the latter case, this computation is bottlenecked by the scalar value logp𝜽(𝐱,𝐳), while in the former case it is bottlenecked by the much wider vector 𝐳logp𝜽(𝐱,𝐳).

Chapter 3 Beyond Gaussian Posteriors

In this chapter we discuss techniques for improving the flexibility of the inference model qϕ(𝐳|𝐱). Increasing the flexibility and accuracy of the inference model wel generally improve the tightness of the variational bound (ELBO), bringing it closer the true marginal likelihood objective.

3.1 Requirements for Computational Tractability

Requirements for the inference model, in order to be able to efficiently optimize the ELBO, are that it is (1) computationally efficient to compute and differentiate its probability density qϕ(𝐳|𝐱), and (2) computationally efficient to sample from, since both these operations need to be performed for each datapoint in a minibatch at every iteration of optimization. If 𝐳 is high-dimensional and we want to make efficient use of parallel computational resources like GPUs, then parallelizability of these operations across dimensions of 𝐳 is a large factor towards efficiency. This requirement restricts the class of approximate posteriors q(𝐳|𝐱) that are practical to use. In practice this often leads to the use of simple Gaussian posteriors. However, as explained, we also need the density q(𝐳|𝐱) to be sufficiently flexible to match the true posterior p(𝐳|𝐱), in order to arrive at a tight bound.

3.2 Improving the Flexibility of Inference Models

Here we will review two general techniques for improving the flexibility of approximate posteriors in the context of gradient-based variational inference: auxiliary latent variables, and normalizing flows.

3.2.1 Auxiliary Latent Variables

One method for improving the flexibility of inference models, is through the introduction of auxiliary latent variables, as explored by  Salimans et al. (2015)(Ranganath et al., 2015) and  Maaløe et al. (2016).

The methods work by augmenting both the inference model and the generative model with a continuous auxiliary variable, here denoted with 𝐮.

The inference model defines a distribution over both 𝐮 and and 𝐳, which can, for example, factorize as:

qϕ(𝐮,𝐳|𝐱)=qϕ(𝐮|𝐱)qϕ(𝐳|𝐮,𝐱) (3.1)

This inference model augmented with 𝐮, implicitly defines a potentially powerful implicit marginal distribution:

qϕ(𝐳|𝐱) =qϕ(𝐮,𝐳|𝐱)𝑑𝐮 (3.2)

Likewise, we introduce an additional distribution in the generative model: such that our generative model is now over the joint distribution p𝜽(𝐱,𝐳,𝐮). This can, for example, factorize as:

p𝜽(𝐱,𝐳,𝐮)=p𝜽(𝐮|𝐱,𝐳)p𝜽(𝐱,𝐳) (3.3)

The ELBO objective with auxiliary variables, given empirical distribution q𝒟(𝐱), is then (again) equivalent to minimization of a KL divergence:

𝔼q𝒟(𝐱)[𝔼qϕ(𝐮,𝐳|𝐱)[logp𝜽(𝐱,𝐳,𝐮)-logqϕ(𝐮,𝐳|𝐱)]] (3.4)
=DKL(q𝒟,ϕ(𝐱,𝐳,𝐮)||p𝜽(𝐱,𝐳,𝐮) (3.5)

Recall that maximization of the original ELBO objective, without auxiliary variables, is equivalent to minimization of DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)), and that maximization of the expected marginal likelihood is equivalent to minimization of DKL(q𝒟,ϕ(𝐱)||p𝜽(𝐱)).

We can gain additional understanding into the relationship between the objectives, through the following equation:

DKL(q𝒟,ϕ(𝐱,𝐳,𝐮)||p𝜽(𝐱,𝐳,𝐮)) (3.6)
DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)) (3.7)
=DKL(q𝒟(𝐱)||p𝜽(𝐱))+𝔼q𝒟(𝐱)[DKL(q𝒟,ϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱))] (3.8)
DKL(q𝒟(𝐱)||p𝜽(𝐱)) (3.9)

From this equation it can be seen that in principle, the ELBO gets worse by augmenting the VAE with an auxiliary variable 𝐮:


But because we now have access to a much more flexible class of inference distributions, qϕ(𝐳|𝐱), the original ELBO objective DKL(q𝒟,ϕ(𝐱,𝐳)||p𝜽(𝐱,𝐳)) can improve, potentially outweighing the additional cost of
𝔼q𝒟(𝐱,𝐳)[DKL(q𝒟,ϕ(𝐮|𝐱,𝐳)||p𝜽(𝐮|𝐱,𝐳))]. In  Salimans et al. (2015),  (Ranganath et al., 2015) and  Maaløe et al. (2016) it was shown that auxiliary variables can indeed lead to significant improvements in models.

The introduction of auxiliary latent variables in the graph, are a special case of VAEs with multiple layers of latent variables, which are discussed in the chapter 4. In our experiment with CIFAR-10, we make use of multiple layers of stochastic variables.

3.2.2 Normalizing Flows

An alternative approach towards flexible approximate posteriors is Normalizing Flow (NF), introduced by (Rezende and Mohamed, 2015) in the context of stochastic gradient variational inference. In normalizing flows, we build flexible posterior distributions through an iterative procedure. The general idea is to start off with an initial random variable with a relatively simple distribution with a known (and computationally cheap) probability density function, and then apply a chain of invertible parameterized transformations 𝐟t, such that the last iterate 𝐳T has a more flexible distribution11 1 where 𝐱 is the context, such as the value of the datapoint. In case of models with multiple levels of latent variables, the context also includes the value of the previously sampled latent variables.:

ϵ0p(ϵ) (3.10)
for t=1T: (3.11)
ϵt=𝐟t(ϵt-1,𝐱) (3.12)
𝐳=ϵT (3.13)

The Jacobian of the transformation factorizes:

d𝐳dϵ0=t=1Tdϵtdϵt-1 (3.14)

So its determinant also factorizes as well:

log|det(d𝐳dϵ0)|=t=1Tlog|det(dϵtdϵt-1)| (3.15)

As long as the Jacobian determinant of each of the transformations 𝐟t can be computed, we can still compute the probability density function of the last iterate:

logqϕ(𝐳|𝐱)=logp(ϵ0)-t=1Tlogdet|dϵtdϵt-1| (3.16)

Rezende and Mohamed (2015) experimented with a transformation of the form:

𝐟t(ϵt-1)=ϵt-1+𝐮h(𝐰Tϵt-1+b) (3.17)

where 𝐮 and 𝐰 are vectors, 𝐰T is 𝐰 transposed, b is a scalar and h(.) is a nonlinearity, such that 𝐮h(𝐰T𝐳t-1+b) can be interpreted as a MLP with a bottleneck hidden layer with a single unit. This flow does not scale well to a high-dimensional latent space: since information goes through the single bottleneck, a long chain of transformations is required to capture high-dimensional dependencies.

3.3 Inverse Autoregressive Transformations

In order to find a type of normalizing flow that scales well to a high-dimensional space,  Kingma et al. (2016) consider Gaussian versions of autoregressive autoencoders such as MADE  (Germain et al., 2015) and the PixelCNN (van den Oord et al., 2016b). Let 𝐲 be a variable modeled by such a model, with some chosen ordering on its elements 𝐲={yi}i=1D. We will use [𝝁(𝐲),𝝈(𝐲)] to denote the function of the vector 𝐲, to the vectors 𝝁 and 𝝈. Due to the autoregressive structure, the Jacobian matrix is triangular with zeros on the diagonal: [𝝁i,𝝈i]/𝐲j=[0,0] for ji. The elements [μi(𝐲1:i-1),σi(𝐲1:i-1)] are the predicted mean and standard deviation of the i-th element of 𝐲, which are functions of only the previous elements in 𝐲.

Sampling from such a model is a sequential transformation from a noise vector ϵ𝒩(0,𝐈) to the corresponding vector 𝐲: y0=μ0+σ0ϵ0, and for i>0, yi=μi(𝐲1:i-1)+σi(𝐲1:i-1)ϵi. The computation involved in this transformation is clearly proportional to the dimensionality D. Since variational inference requires sampling from the posterior, such models are not interesting for direct use in such applications. However, the inverse transformation is interesting for normalizing flows. As long as we have σi>0 for all i, the sampling transformation above is a one-to-one transformation, and can be inverted:

ϵi=yi-μi(𝐲1:i-1)σi(𝐲1:i-1) (3.18)

Kingma et al. (2016) make two key observations, important for normalizing flows. The first is that this inverse transformation can be parallelized, since (in case of autoregressive autoencoders) computations of the individual elements ϵi do not depend on each other. The vectorized transformation is:

ϵ=(𝐲-𝝁(𝐲))/𝝈(𝐲) (3.19)

where the subtraction and division are element-wise.

The second key observation, is that this inverse autoregressive operation has a simple Jacobian determinant. Note that due to the autoregressive structure, [μi,σi]/yj=[0,0] for ji. As a result, the transformation has a lower triangular Jacobian (ϵi/yj=0 for j>i), with a simple diagonal: ϵi/yi=1σi. The determinant of a lower triangular matrix equals the product of the diagonal terms. As a result, the log-determinant of the Jacobian of the transformation is remarkably simple and straightforward to compute:

logdet|dϵd𝐲|=i=1D-logσi(𝐲) (3.20)

The combination of model flexibility, parallelizability across dimensions, and simple log-determinant, makes this transformation interesting for use as a normalizing flow over high-dimensional latent space.

For the following section we will use a slightly different, but equivalently flexible, transformation of the type:

ϵ=𝝈(𝐲)𝐲+𝝁(𝐲) (3.21)

With corresponding log-determinant:

logdet|dϵd𝐲|=i=1Dlogσi(𝐲) (3.22)

3.4 Inverse Autoregressive Flow (IAF)

\SetKwFunctionEncoderNNEncoderNN \SetKwFunctionAutoregressiveNNAutoregressiveNN \SetKwFunctionsigmoidsigmoid \SetKwFunctiondosumsum \SetKwFunctionsigmoidsigmoid \SetKwInOutInputinput \SetKwInOutOutputoutput \KwData
     𝐱: a datapoint, and optionally other conditioning information
     𝜽: neural network parameters
     \EncoderNN(𝐱;𝜽): encoder neural network, with additional output 𝐡
     \AutoregressiveNN[*](𝐳,𝐡;𝜽): autoregressive neural networks, with additional input 𝐡
     \dosum(.): sum over vector elements
     \sigmoid(.): element-wise sigmoid function
     𝐳: a random sample from q(𝐳|𝐱), the approximate posterior distribution
     l: the scalar value of logq(𝐳|𝐱), evaluated at sample ’𝐳\BlankLine[𝝁,𝝈,𝐡]\EncoderNN(𝐱;𝜽)
\Fort1 \KwToT [𝐦,𝐬]\AutoregressiveNN[t](𝐳,𝐡;𝜽)
\algorithmcfname 3 Pseudo-code of an approximate posterior with Inverse Autoregressive Flow (IAF)
Figure 3.1: Like other normalizing flows, drawing samples from an approximate posterior with Inverse Autoregressive Flow (IAF) (Kingma et al., 2016) starts with a distribution with tractable density, such as a Gaussian with diagonal covariance, followed by a chain of nonlinear invertible transformations of 𝐳, each with a simple Jacobian determinant. The final iterate has a flexible distribution.
(a) Prior distribution
(b) Factorized posteriors
(c) IAF posteriors
Figure 3.2: Best viewed in color. We fitted a variational auto-encoder (VAE) with a spherical Gaussian prior, and with factorized Gaussian posteriors (b) or inverse autoregressive flow (IAF) posteriors (c) to a toy dataset with four datapoints. Each colored cluster corresponds to the posterior distribution of one datapoint. IAF greatly improves the flexibility of the posterior distributions, and allows for a much better fit between the posteriors and the prior.

Kingma et al. (2016) propose inverse autoregressive flow (IAF) based on a chain of transformations that are each equivalent to an inverse autoregressive transformation of eq. (3.19) and eq. (3.21). See algorithm  3 for pseudo-code of an approximate posterior with the proposed flow. We let an initial encoder neural network output 𝝁0 and 𝝈0, in addition to an extra output 𝐡, which serves as an additional input to each subsequent step in the flow. The chain is initialized with a factorized Gaussian qϕ(𝐳0|𝐱)=𝒩(0,diag(𝝈)2):

ϵ0 𝒩(0,I) (3.23)
(𝝁0,log𝝈0,𝐡) =EncoderNeuralNet(𝐱;𝜽) (3.24)
𝐳0 =𝝁0+𝝈0ϵ0 (3.25)

IAF then consists of a chain of T of the following transformations:

(𝝁t,𝝈t) =AutoregressiveNeuralNett(ϵt-1,𝐡;𝜽) (3.26)
ϵt =𝝁t+𝝈tϵt-1 (3.27)

Each step of this flow is an inverse autoregressive transformation of the type of eq. (3.19) and eq. (3.21), and each step uses a separate autoregressive neural network. Following eq. (3.16), the density under the final iterate is:

𝐳 ϵT (3.28)
logq(𝐳|𝐱) =-i=1D(12ϵi2+12log(2π)+t=0Tlogσt,i) (3.29)

The flexibility of the distribution of the final iterate ϵT, and its ability to closely fit to the true posterior, increases with the expressivity of the autoregressive models and the depth of the chain. See figure 3.1 for an illustration of the computation.

A numerically stable version, inspired by the LSTM-type update, is where we let the autoregressive network output (𝐦t,𝐬t), two unconstrained real-valued vectors, and compute ϵt as:

(𝐦t,𝐬t) =AutoregressiveNeuralNett(ϵt-1,𝐡;𝜽) (3.30)
𝝈t =sigmoid(𝐬t) (3.31)
ϵt =𝝈tϵt-1+(1-𝝈t)𝐦t (3.32)

This version is shown in algorithm 3. Note that this is just a particular version of the update of eq. (3.27), so the simple computation of the final log-density of eq. (3.29) still applies.

It was found beneficial for results to parameterize or initialize the parameters of each AutoregressiveNeuralNett such that its outputs 𝐬t are, before optimization, sufficiently positive, such as close to +1 or +2. This leads to an initial behavior that updates ϵ only slightly with each step of IAF. Such a parameterization is known as a ’forget gate bias’ in LSTMs, as investigated by Jozefowicz et al. (2015).

It is straightforward to see that a special case of IAF with one step, and a linear autoregressive model, is the fully Gaussian posterior discussed earlier. This transforms a Gaussian variable with diagonal covariance, to one with linear dependencies, i.e. a Gaussian distribution with full covariance.

Autoregressive neural networks form a rich family of nonlinear transformations for IAF. For non-convolutional models, the family of masked autoregressive network introduced in (Germain et al., 2015) was used as the autoregressive neural networks. For CIFAR-10 experiments, which benefits more from scaling to high dimensional latent space, the family of convolutional autoregressive autoencoders introduced by  (van den Oord et al., 2016b, a) was used.

It was found that results improved when reversing the ordering of the variables after each step in the IAF chain. This is a volume-preserving transformation, so the simple form of eq. (3.29) remains unchanged.

3.5 Related work

As we explained, inverse autoregressive flow (IAF) is a member of the family of normalizing flows, first discussed in (Rezende and Mohamed, 2015) in the context of stochastic variational inference. In (Rezende and Mohamed, 2015) two specific types of flows are introduced: planar flows and radial flows. These flows are shown to be effective to problems relatively low-dimensional latent space (at most a few hundred dimensions). It is not clear, however, how to scale such flows to much higher-dimensional latent spaces, such as latent spaces of generative models of /larger images, and how planar and radial flows can leverage the topology of latent space, as is possible with IAF. Volume-conserving neural architectures were first presented in in (Deco and Brauer, 1995), as a form of nonlinear independent component analysis.

Another type of normalizing flow, introduced by (Dinh et al., 2014) (NICE), uses similar transformations as IAF. In contrast with IAF, NICE was directly applied to the observed variables in a generative model. NICE is type of transformations that updates only half of the variables 𝐳1:D/2 per step, adding a vector f(𝐳D/2+1:D) which is a neural network based function of the remaining latent variables 𝐳D/2+1:D. Such large blocks have the advantage of computationally cheap inverse transformation, and the disadvantage of typically requiring longer chains. In experiments, (Rezende and Mohamed, 2015) found that this type of transformation is generally less powerful than other types of normalizing flow, in experiments with a low-dimensional latent space. Concurrently to our work, NICE was extended to high-dimensional spaces in (Dinh et al., 2016) (Real NVP). An empirical comparison would be an interesting subject of future research.

A potentially powerful transformation is the Hamiltonian flow used in Hamiltonian Variational Inference (Salimans et al., 2015). Here, a transformation is generated by simulating the flow of a Hamiltonian system consisting of the latent variables 𝐳, and a set of auxiliary momentum variables. This type of transformation has the additional benefit that it is guided by the exact posterior distribution, and that it leaves this distribution invariant for small step sizes. Such as transformation could thus take us arbitrarily close to the exact posterior distribution if we can apply it for a sufficient number of times. In practice, however, Hamiltonian Variational Inference is very demanding computationally. Also, it requires an auxiliary variational bound to account for the auxiliary variables, which can impede progress if the bound is not sufficiently tight.

An alternative method for increasing the flexibility of variational inference, is the introduction of auxiliary latent variables  (Salimans et al., 2015; Ranganath et al., 2015; Tran et al., 2015), discussed in  3.2.1, and corresponding auxiliary inference models. Latent variable models with multiple layers of stochastic variables, such as the one used in our experiments, are often equivalent to such auxiliary-variable methods. We combine deep latent variable models with IAF in our experiments, benefiting from both techniques.

Chapter 4 Deeper Generative Models

In the previous chapter we explain advanced strategies for improving inference models. In this chapter, we review strategies for learning deeper generative model, such as inference and learning with multiple latent variables or observed variables, and techniques for improving the flexibility of the generative models p𝜽(𝐱,𝐳).

4.1 Inference and Learning with Multiple Latent Variables

The generative model p𝜽(𝐱,𝐳), and corresponding inference model qϕ(𝐳|𝐱) can be parameterized as any directed graph. Both 𝐱 and 𝐳 can be composed of multiple variables with some topological ordering. It may not be immediately obvious how to optimize such models in the VAE framework; it is, however, quite straightforward, as we will now explain.

Let 𝐳={𝐳1,,𝐳K}, and qϕ(𝐳|𝐱)=qϕ(𝐳1,,𝐳K|𝐱) where subscript corresponds with the topological ordering of each variable. Given a datapoint 𝐱, computation of the ELBO estimator consists of two steps:

  1. 1.

    Sampling 𝐳qϕ(𝐳|𝐱). In case of multiple latent variables, this means ancestral sampling the latent variables one by one, in topological ordering defined by the inference model’s directed graph. In pseudo-code, the ancestral sampling step looks like:

    for i=1K: (4.1)
    𝐳iqϕ(𝐳i|Pa(𝐳i)) (4.2)

    where Pa(𝐳i) are the parents of variable 𝐳i in the inference model, which may include 𝐱. In reparameterized (and differentiable) form, this is:

    for i=1K: (4.3)
    ϵip(ϵi) (4.4)
    𝐳i=𝐠i(ϵi,Pa(𝐳i),ϕ) (4.5)
  2. 2.

    Evaluating the scalar value (logp𝜽(𝐱,𝐳)-logqϕ(𝐳|𝐱)) at the resulting sample 𝐳 and datapoint 𝐱. This scalar is the unbiased stochastic estimate lower bound on logp𝜽(𝐱). It is also differentiable and optimizable with SGD.

4.1.1 Choice of ordering

(a) VAE with bottom-up inference.
(b) VAE with top-down inference.
Figure 4.1: Illustration, taken from  Kingma et al. (2016), of two choices of directionality of the inference model. Sharing directionality of inference, as in (b), has the benefit that it allows for straightforward sharing of parameters between the generative model and the inference model.

It should be noted that the choice of latent variables’ topological ordering for the inference model, can be different from the choice of ordering for the generative model.

Since the inference model has the data as root node, while the generative model has the data as leaf node, one (in some sense) logical choice would be to let the topological ordering of the latent variables in the inference model, be the reverse of the ordering in the generative model.

In multiple works  (Salimans, 2016; Kaae Sønderby et al., 2016; Kingma et al., 2016) it has been shown that it can be advantageous to let the generative model and inference model share the topological ordering of latent variables. The two choices of ordering are illustrated in figure  4.1. One advantage of shared ordering, as explained in these works, is that this allows us to easily share parameters between the inference and generative models, leading to faster learning and better solutions.

To see why this might be a good idea, one should realize that the true posterior over the latent variables, is a function of the prior:

p𝜽(𝐳|𝐱)p𝜽(𝐳)p𝜽(𝐱|𝐳) (4.6)

Likewise, the posterior of a latent variable given its parents (in the generative model), is:

p𝜽(𝐳i|𝐱,Pa(𝐳i))p𝜽(𝐳i|Pa(𝐳i))p𝜽(𝐱|𝐳i,Pa(𝐳i)) (4.7)

Optimization of the generative model changes both p𝜽(𝐳i|Pa(𝐳i)) and p𝜽(𝐱|𝐳i,Pa(𝐳i)). By coupling the inference model qϕ(𝐳i|𝐱,Pa(𝐳i)) and prior p𝜽(𝐳i|Pa(𝐳i)), changes in p𝜽(𝐳i|Pa(𝐳i)) can be directly reflected in changes in qϕ(𝐳i|Pa(𝐳i)).

This coupling is especially straightforward when p𝜽(𝐳i|Pa(𝐳i)) is Gaussian distributed. The inference model can be directly specified as the product of this Gaussian distribution, with a learned quadratic pseudo-likelihood term: qϕ(𝐳i|Pa(𝐳i),𝐱)=p𝜽(𝐳i|Pa(𝐳i))l~(𝐳i;𝐱,Pa(𝐳i))/Z, where Z is tractable to compute. This idea explored by  (Salimans, 2016) and  (Kaae Sønderby et al., 2016). In principle this idea could be extended to a more general class of conjugate priors, by no work on this is known at the time of writing.

A less constraining variant, explored by  Kingma et al. (2016), is to simply let the neural network that parameterizes qϕ(𝐳i|Pa(𝐳i),𝐱) be partially specified by a part of the neural network that parameterizes p𝜽(𝐳i|Pa(𝐳i)). In general, we can let the two distributions share parameters. This allows for more complicated posteriors, like normalizing flows or IAF.

4.2 Alternative methods for increasing expressivity of generative models

Typically, especially with large data sets, we wish to choose an expressive class of directed models, such that it can feasibly approximate the true distribution. Popular strategies for specifying expressive models are:

  • Introduction of latent variables into the directed models, and optimization through (amortized) variational inference, as explained in this work.

  • Full autoregression: factorization of distributions into univariate (one-dimensional) conditionals, or at least very low-dimensional conditionals (section 4.3).

  • Specification of distributions through invertible transformations with tractable Jacobian determinant (section 4.4).

Synthesis from fully autoregressive models models is relatively slow, since the length of computation for synthesis from such models is linear in the dimensionality of the data. The length of computation of the log-likelihood of fully autoregressive models does not necesarilly scale with the dimensionality of the data In this respect, introduction of latent variables for improving expressivity is especially interesting when 𝐱 is very high-dimensional. It is relatively straightforward and computationally attractive, due to parallelizability, to specify directed models over high-dimensional variables where each conditional factorizes into independent distributions. For example, if we let p𝜽(𝐱j|Pa(𝐱j))=kp𝜽(xj,k|Pa(𝐱j), where each factor is a univariate Gaussian whose means and variance are nonlinear functions (specified by a neural network) of the parents Pa(𝐱j), then computations for both synthesis and evaluation of log-likelihood can be fully parallelized across dimensions k. See (Kingma et al., 2016) for experiments demonstrating a 100x improvement in speed of synthesis.

The best models to date, in terms of attained log-likelihood on test data, employ a combination of the three approaches listed above.

4.3 Autoregressive Models

A powerful strategy for modeling high-dimensional data is to divide up the high-dimensional observed variables into small constituents (often single dimensional parts, or otherwise just parts with a small number of dimensions), impose a certain ordering, and to model their dependencies as a directed graphical model. The resulting directed graphical model breaks up the joint distribution into a product of a factors:

p𝜽(𝐱)=p𝜽(x1,,xD)=p𝜽(x1)j=2Tp𝜽(xj|Pa(𝐱j)) (4.8)

where D is the dimensionality of the data. This is known as an autoregressive (AR) model. In case of neural network based autoregressive models, we let the conditional distributions be parameterized with a neural network:

p𝜽(xj|𝐱<j)=p𝜽(xj|NeuralNet𝜽j(Pa(𝐱j))) (4.9)

In case of continuous data, autoregressive models can be interpreted as a special case of a more general approach: learning an invertible transformation from the data to a simpler, known distribution such as a Gaussian or Uniform distribution; this approach with invertible transformations is discussed in section  4.4. The techniques of autoregressive models and invertible transformations can be naturally combined with variational autoencoders, and at the time of writing, the best systems use a combination Rezende and Mohamed (2015); Kingma et al. (2016); Gulrajani et al. (2016).

A disadvantage of autoregressive models, compared to latent-variable models, is that ancestral sampling from autoregressive models is a sequential operation computation of 𝒪(D) length, i.e. proportional to the dimensionality of the data. Autoregressive models also requires one to choose a specific ordering of input elements (equation (4.8)). When no single natural one-dimensional ordering exists, like in two-dimensional images, this leads to a model with a somewhat awkward inductive bias.

4.4 Invertible transformations with tractable Jacobian determinant

In case of continuous data, autoregressive models can be interpreted as a special case of a more general approach: learning an invertible transformation with tractable Jacobian determinant (also called normalizing flow) from the data to a simpler, known distribution such as a Gaussian or Uniform distribution. If we use neural networks for such invertible mappings, this is a powerful and flexible approach towards probabilistic modeling of continuous data and nonlinear independent component analysis (Deco and Brauer, 1995).

Such normalizing flows iteratively update a variable, which is constrained to be of the same dimensionality as the data, to a target distribution. This constraint on the dimensionality if intermediate states of the mapping, can make such transformations more challenging to optimize than methods without such constraint. An obvious advantage, on the other hand, is that the likelihood and its gradient are tractable. In (Dinh et al., 2014, 2016), particularly interesting flows (NICE and Real NVP) were introduced, with equal computational cost and depth in both directions, making it both relatively cheap to optimize and to sample from such models. At the time of writing, no such models has not yet been demonstrated to lead to the similar performance as purely autoregressive or VAE-based models in terms of data log-likelihood, but this remains an active area of research.

4.5 Follow-Up Work

Some important applications and motivations for deep generative models and variational autoencoders are:

  • Representation learning: learning better representations of the data. Some uses of this are:

    • Data-efficient learning, such as semi-supervised learning

    • Visualisation of data as low-dimensional manifolds

  • Artificial creativity: plausible interpolation between data and extrapolation from data.

Here we will now highlight some concrete applications to representation learning and artificial creativity.

4.5.1 Representation Learning

In case of supervised learning, we typically aim to learn a conditional distribution: to predict the distribution over the possible values of a variable, given the value of some another variable. One such problem is that of image classification: given an image, the prediction of a distribution over the possible class labels. Through the yearly ImageNet competion (Russakovsky et al., 2015), it has become clear that deep convolutional neural networks (LeCun et al., 1998; Goodfellow et al., 2016) (CNNs), given a large amount of labeled images, are extraordinarily good at solving the image classification task. Modern versions of CNNs based on residual networks, which is a variant of LSTM-type neural networks (Hochreiter and Schmidhuber, 1997), now arguably achieves human-level classification accuracy on this task (He et al., 2015b, a).

When the number of labeled examples is low, solutions found with purely supervised approaches tend to exhibit poor generalization to new data. In such cases, generative models can be employed as an effective type of regularization. One particular strategy, presented in  Kingma et al. (2014), is to optimize the classification model jointly with a variational autoencoder over the input variables, sharing parameters between the two. The variational autoencoder, in this case, provides an auxiliary objective, improving the data efficiency of the classification solution. Through sharing of statistical strength between modeling problems, this can greatly improve upon the supervised classification error. Techniques based on VAEs are now among state of the art for semi-supervised classification (Maaløe et al., 2016), with on average under 1% classification error in the MNIST classification problem, when trained with only 10 labeled images per class, i.e. when more than 99.8% of the labels in the training set were removed. In concurrent work (Rezende et al., 2016b), it was shown that VAE-based semi-supervised learning can even do well when only a single sample per class is presented.

A standard supervised approach, GoogLeNet (Szegedy et al., 2015), which normally achieves near state-of-the-art performance on the ImageNet validation set, achieves only around 5% top-1 classification accuracy when trained with only 1% of the labeled images, as shown by Pu et al. (2016). In contrast, they show that a semi-supervised approach with VAEs achieves around 45% classification accuracy on the same task, when modeling the labels jointly with the labeled and unlabeled input images.

4.5.2 Understanding of data, and artificial creativity

Generative models with latent spaces allow us to transform the data into a simpler latent space, explore it in that space, and understand it better. A related branch of applications of deep generative models is the synthesis of plausible pseudo-data with certain desirable properties, sometimes coined as artificial creativity. Chemical Design
Figure 4.2: (a) Application of a VAE to chemical design in (Gómez-Bombarelli et al., 2016). A latent continuous representation 𝐳 of molecules is learned on a large dataset of molecules. (b) This continuous representation enables gradient-based search of new molecules that maximizes f(𝐳), a certain desired property.

One example of a recent scientific application of artificial creativity, is shown in  Gómez-Bombarelli et al. (2016). In this paper, a fairly straightforward VAE is trained on hundreds of thousands of existing chemical structures. The resulting continuous representation (latent space) is subsequently used to perform gradient-based optimization towards certain properties; the method is demonstrated on the design of drug-like molecules and organic light-emitting diodes. See figure 4.2. Natural Language Synthesis
Figure 4.3: An application of VAEs to interpolation between pairs of sentences, from (Bowman et al., 2015). The intermediate sentences are grammatically correct, and the topic and syntactic structure are typically locally consistent.

A similar approach was used to generating natural-language sentences from a continuous space by Bowman et al. (2015). In this paper, it is shown how a VAE can be successfully trained on text. The model is shown to succesfully interpolate between sentences, and for imputation of missing words. See figure 4.3. Astronomy

In (Ravanbakhsh et al., 2016), VAEs are applied to simulate observations of distant galaxies. This helps with the calibration of systems that need to indirectly detect the shearing of observations of distant galaxies, caused by weak gravitational lensing in the presence of dark matter between earth and those galaxies. Since the lensing effects are so weak, such systems need to be calibrated with ground-truth images with a known amount of shearing. Since real data is still limited, the proposed solution is to use deep generative models for synthesis of pseudo-data. Image (Re-)Synthesis
Figure 4.4: VAEs can be used for image resynthesis. In this example by  White (2016), an original image (left) is modified in a latent space in the direction of a smile vector, producing a range of versions of the original, from smiling to sadness.

A popular application is image (re)synthesis. One can optimize a VAE to form a generative model over images. One can synthesize images from the generative model, but the inference model (or encoder) also allows one to encode real images into a latent space. One can modify the encoding in this latent space, then decode the image back into the observed space. Relatively simply transformations in the observed space, such as linear transformations, often translate into semantically meaningful modifications of the original image. One example, as demonstrated by White (2016), is the modification of images in latent space along a "smile vector" in order to make them more happy, or more sad looking. See figure 4.4 for an example.

4.5.3 Other relevant follow-up work

We unfortunately do not have space to discuss all follow-up work in depth, but will here highlight a selection of relevant recent work.

In addition to our original publication (Kingma and Welling, 2013), two later papers have proposed equivalent algorithms Rezende et al. (2014); Lázaro-Gredilla (2014), where the latter work applies the same reparameterization gradient method to the estimation of parameter posteriors, rather than amortized latent-variable inference.

In the appendix of (Kingma and Welling, 2013) we proposed to apply the reparameterization gradients to estimation of parameter posteriors. In (Blundell et al., 2015) this method, with a mixture-of-Gaussians prior and named Bayes by Backprop, was used in experiments with some promising early results. In (Kingma et al., 2015) we describe a refined method, the local reparameterization trick, for further decreasing the variance of the gradient estimator, and applied it to estimation of Gaussian parameter posteriors. Further results were presented in (Louizos et al., 2017; Louizos and Welling, 2017, 2016) with increasingly sophisticated choices of priors and approximate posteriors. In  (Kingma et al., 2015; Gal and Ghahramani, 2016), a similar reparameterization was used to analyze Dropout as a Bayesian method, coined Variational Dropout. In  (Molchanov et al., 2017) this method was further analyzed and refined. Various papers have applied reparameterization gradients for estimating parameter posteriors, including (Fortunato et al., 2017) in the context of recurrent neural networks and (Kucukelbir et al., 2016) more generally for Bayesian models and in (Tran et al., 2017) for deep probabilistic programming. A Bayesian nonparametric variational family based in the Gaussian Process using reparameterization gradients was proposed in (Tran et al., 2015).

Normalizing flows (Rezende and Mohamed, 2015) were proposed as a framework for improving the flexibility of inference models. In  Kingma et al. (2016), the first normalizing flow was proposed that scales well to high-dimensional latent spaces. The same principle was later applied in (Papamakarios et al., 2017) for density estimation, and further refined in (Huang et al., 2018). Various other flows were proposed in (Tomczak and Welling, 2016, 2017) and  (Berg et al., 2018).

As an alternative to (or in conjunction with) normalizing flows, one can use auxiliary variables to improve posterior flexibility. This principle was, to the best of our knowledge, first proposed in  Salimans et al. (2015). In this paper, the principle was used in a combination of variational inference with Hamiltonian Monte Carlo (HMC), with the momentum variables of HMC as auxiliary variables. Auxiliary variables were more elaborately discussed in in (Maaløe et al., 2016) as Auxiliary Deep Generative Models. Similarly, one can use deep models with multiple stochastic layers to improve the variational bound, as demonstrated in  (Sønderby et al., 2016a) and  (Sønderby et al., 2016b) as Ladder VAEs.

There has been plenty of follow-up work on gradient variance reduction for the variational parameters of discrete latent variables, as opposed to continuous latent variables for which reparameterization gradients apply. These proposals include NVIL  (Mnih and Gregor, 2014), MuProp  (Gu et al., 2015), Variational inference for Monte Carlo objectives (Mnih and Rezende, 2016), the Concrete distribution  (Maddison et al., 2016) and Categorical Reparameterization with Gumbel-Softmax (Jang et al., 2016).

The ELBO objective can be generalized into an importance-weighted objective, as proposed in  (Burda et al., 2015) (Importance-Weighted Autoencoders). This potentially reduces the variance in the gradient, but has not been discussed in-depth here since (as often the case with importance-weighted estimators) it can be difficult to scale to high-dimensional latent spaces. Other objectives have been proposed such as Rényi divergence variational inference (Li and Turner, 2016), Generative Moment Matching Networks (Li et al., 2015), objectives based on normalizing such as NICE and RealNVP flows (Sohl-Dickstein et al., 2015; Dinh et al., 2014), black-box α-divergence minimization (Hernández-Lobato et al., 2016) and Bi-directional Helmholtz Machines (Bornschein et al., 2016).

Various combinations with adversarial objectives have been proposed. In  (Makhzani et al., 2015), the "adversarial autoencoder" (AAE) was proposed, a probabilistic autoencoder that uses a generative adversarial networks (GAN) to perform variational inference. In  (Dumoulin et al., 2016) Adversarially Learned Inference (ALI) was proposed, which aims to minimize a GAN objective between the joint distributions qϕ(𝐱,𝐳) and p𝜽(𝐱,𝐳). Other hybrids have been proposed as well (Larsen et al., 2015; Brock et al., 2016; Hsu et al., 2017).

One of the most prominent, and most difficult, applications of generative models is image modeling. In (Kulkarni et al., 2015) (Deep convolutional inverse graphics network), a convolutional VAE was applied to modeling images with some success, building on work by  (Dosovitskiy et al., 2014) proposing convolutional networks for image synthesis. In  (Gregor et al., 2015) (DRAW), an attention mechanism was combined with a recurrent inference model and recurrent generative model for image synthesis. This approach was further extended in  (Gregor et al., 2016) (Towards Conceptual Compression) with convolutional networks, scalable to larger images, and applied to image compression. In (Kingma et al., 2016), deep convolutional inference models and generative models were also applied to images. Furthermore,  (Gulrajani et al., 2016) (PixelVAE) and  (Chen et al., 2016) (Variational Lossy Autoencoder) combined convolutional VAEs with the PixelCNN model (van den Oord et al., 2016b, a). Methods and VAE architectures for controlled image generation from attributes or text were studied in (Kingma et al., 2014; Yan et al., 2016; Mansimov et al., 2015; Brock et al., 2016; Yeh et al., 2016; White, 2016). Predicting the color of pixels based on a grayscale image is another promising application (Deshpande et al., 2016). The application to semi-supervised learning has been studied in  (Kingma et al., 2014; Pu et al., 2016; Xu et al., 2017) among other work.

Another prominent application of VAEs is modeling of text and or sequential data (Bayer and Osendorfer, 2014; Fabius and van Amersfoort, 2014; Krishnan et al., 2015; Bowman et al., 2015; Serban et al., 2016; Johnson et al., 2016; Karl et al., 2016; Fraccaro et al., 2016; Miao et al., 2016; Semeniuta et al., 2017; Zhao et al., 2017; Yang et al., 2017; Hu et al., 2017). VAEs have also been applied to speech and handwriting  Chung et al. (2015). Sequential models typically use recurrent neural networks, such as LSTMs (Hochreiter and Schmidhuber, 1997), as encoder and/or decoder. When modeling sequences, the validity of a sequence can sometimes by constrained by a context-free grammar. In this case, incorporation of the grammar in VAEs can lead to better models, as shown in  (Kusner et al., 2017) (Grammar VAEs), and applied to modeling molecules in textual representations.

Since VAEs can transform discrete observation spaces to continuous latent-variable spaces with approximately known marginals, they are interesting for use in model-based control (Watter et al., 2015; Pritzel et al., 2017). In  (Heess et al., 2015b) (Stochastic Value Gradients) it was shown that the re-parameterization of the observed variables, together with an observation model, can be used to compute novel forms of policy gradients. In  (Heess et al., 2015a), the reparameterization trick was used for memory-based control. Variational inference and reparameterization gradients have also been used for variational information maximisation for intrinsically motivated reinforcement learning (Mohamed and Rezende, 2015) and VIME (Houthooft et al., 2016) for improved exploration. Variational autoencoders have also been used as components in models that perform iterative reasoning about objects in a scene (Eslami et al., 2016).

In (Higgins et al., 2016) (β-VAE) it was proposed to strengthen the contribution of DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳)), thus restricting the information flow through the latent space, which was shown to improve disentanglement of latent factors, further studied in (Chen et al., 2018).

Other applications include modeling of graphs (Kipf and Welling, 2016) (Variational Graph Autoencoders), learning of 3D structure from images (Rezende et al., 2016a), one-shot learning (Rezende et al., 2016b), learning nonlinear state space models (Krishnan et al., 2017), voice conversion from non-parallel corpora (Hsu et al., 2016), discrimination-aware (fair) representations  (Louizos et al., 2015) and transfer learning  (Edwards and Storkey, 2016).

The reparameterization gradient estimator discussed in this work has been extended in various directions (Ruiz et al., 2016), including acceptance-rejection sampling algorithms  (Naesseth et al., 2017). The gradient variance can in some cases be reduced by ’carving up the ELBO’  (Hoffman and Johnson, 2016; Roeder et al., 2017) and using a modified gradient estimator. A second-order gradient estimator has also been proposed in (Fan et al., 2015).

All in all, this remains an actively researched area with frequently exciting developments.

Chapter 5 Conclusion

Directed probabilistic models form an important aspect of modern artificial intelligence. Such models can be made incredibly flexible by parameterizing the conditional distributions with differentiable deep neural networks.

Optimization of such models towards the maximum likelihood objective is straightforward in the fully-observed case. However, one is often more interested in flexible models with latent variables, such as deep latent-variable models, or Bayesian models with random parameters. In both cases one needs to perform approximate posterior estimation for which variational inference (VI) methods are suitable. In VI, inference is cast as an optimization problem over newly introduced variational parameters, typically optimized towards the ELBO, a lower bound on the model evidence, or marginal likelihood of the data. Existing methods for such posterior inference were either relatively inefficient, or not applicable to models with neural networks as components. Our main contribution is a framework for efficient and scalable gradient-based variational posterior inference and approximate maximum likelihood learning.

In this paper we describe the variational autoencoder and its extensions. It is a combination of a deep latent-variable model (DLVM) with continuous latent variables, and an associated inference model. The DLVM is a type of generative model over the data. The inference model, also called encoder or recognition model, approximates the posterior distribution of the latent variables of the generative model. Both the generative model and the inference model are directed graphical models that are wholly or partially parameterized by deep neural networks. The parameters of the models, including the parameters of the neural networks such as the weights and biases, are jointly optimized by performing stochastic gradient ascent on the so-called evidence lower bound (ELBO). The ELBO is a lower bound on the marginal likelihood of the data, also called the variational lower bound. Stochastic gradients, necessary for performing SGD, are obtained through a basic reparameterization trick. The VAE framework is now a commonly used tool for various applications of probabilistic modeling and artificial creativity, and basic implementations are available in most major deep learning software libraries.

For learning flexible inference models, we proposed inverse autoregressive flows (IAF), a type of normalizing flow that allows scaling to high-dimensional latent spaces. The VAE framework is compatible with almost arbitrary differentiable neural networks. We showed that a VAE with an IAF posterior and ResNets, a novel type of neural network, can learn models of natural images that are close to state-of-the-art in terms of log-likelihood, while allowing for orders of magnitude faster sampling. An interesting direction for further exploration is comparison with transformations with computationally cheap inverses, such as NICE (Dinh et al., 2014) and Real NVP (Dinh et al., 2016). Application of such transformations in the VAE framework can potentially lead to relatively simple VAEs with a combination of powerful posteriors, priors and decoders. Such architectures can potentially rival or surpass purely autoregressive architectures (van den Oord et al., 2016a), while allowing much faster synthesis.

The proposed VAE framework remains the only framework in the literature that allows for both discrete and continuous observed variables, allows for efficient amortized latent-variable inference and fast synthesis, and which can produce close to state-of-the-art performance in terms of the log-likelihood of data.

Chapter 6 Appendix

6.1 Notation and definitions

6.1.1 Notation

Example(s) Description
𝐱,𝐲𝐳 With characters in bold we typically denote random vectors. We also use this notation for collections of random variables variables.
x,y,z With characters in italic we typically denote random scalars, i.e. single real-valued numbers.
𝐗,𝐘,𝐙 With bold and capitalized letters we typically denote random matrices.
Pa(𝐳) The parents of random variable 𝐳 in a directed graph.
diag(𝐱) Diagonal matrix, with the values of vector 𝐱 on the diagonal.
𝐱𝐲 Element-wise multiplication of two vectors. The resulting vector is (x1y1,,xKyK)T.
θ Parameters of a (generative) model are typically denoted with the Greek lowercase letter θ (theta).
ϕ Variational parameters are typically denoted with the bold Greek letter ϕ (phi).
p(𝐱),p(𝐳) Probability density functions (PDFs) and probability mass functions (PMFs), also simply called distributions, are denoted by p(.), q(.) or r(.).
p(𝐱,𝐲,𝐳) Joint distributions are denoted by p(.,.)
p(𝐱|𝐳) Conditional distributions are denoted by p(.|.)
p(.;θ),pθ(𝐱) The parameters of a distribution are denoted with p(.;θ) or equivalently with subscript pθ(.).
p(𝐱=𝐚),p(𝐱𝐚) We may use an (in-)equality sign within a probability distribution to distinguish between function arguments and value at which to evaluate. So p(𝐱=𝐚) denotes a PDF or PMF over variable 𝐱 evaluated at the value of variable 𝐚. Likewise, p(𝐱𝐚) denotes a CDF evaluated at the value of 𝐚.
p(.),q(.) We use different letters to refer to different probabilistic models, such as p(.) or q(.). Conversely, we use the same letter across different marginals/conditionals to indicate they relate to the same probabilistic model.

6.1.2 Definitions

Term Description
Probability density function (PDF) A function that assigns a probability density to each possible value of given continuous random variables.
Cumulative distribution function (CDF) A function that assigns a cumulative probability density to each possible value of given univariate continuous random variables.
Probability mass function (PMF) A function that assigns a probability mass to given discrete random variable.

6.1.3 Distributions

We overload the notation of distributions (e.g. p(𝐱)=𝒩(𝐱;𝝁,𝚺)) with two meanings: (1) a distribution from which we can sample, and (2) the probability density function (PDF) of that distribution.

Term Description
Categorical(x;𝐩) Categorical distribution, with parameter 𝐩 such that ipi=1.
Bernoulli(𝐱;𝐩) Multivariate distribution of independent Bernoulli. Bernoulli(𝐱;𝐩)=iBernoulli(xi;pi) with i:0pi1.
Normal(𝐱;𝝁,𝚺)=𝒩(𝐱;𝝁,𝚺) Multivariate Normal distribution with mean 𝝁 and covariance 𝚺. Chain rule of probability
p(𝐚,𝐛)=p(𝐚)p(𝐛|𝐚) (6.1) Bayes’ Rule
p(𝐚|𝐛)=p(𝐛|𝐚)p(𝐚)/p(𝐛) (6.2)

6.1.4 Bayesian Inference

Let p(θ) be a chosen marginal distribution over its parameters θ, called a prior distribution. Let 𝒟 be observed data, p(𝒟|θ)p𝜽(𝒟) be the probability assigned to the data under the model with parameters θ. Recall the chain rule in probability:


Simply re-arranging terms above, the posterior distribution over the parameters θ, taking into account the data 𝒟, is:

p(θ|𝒟)=p(𝒟|θ)p(θ)p(𝒟)p(𝒟|θ)p(θ) (6.3)

where the proportionality () holds since p(𝒟) is a constant that is not dependent on parameters θ. The formula above is known as Bayes’ rule, a fundamental formula in machine learning and statistics, and is of special importance to this work.

A principal application of Bayes’ rule is that it allows us to make predictions about future data 𝐱, that are optimal as long as the prior p(θ) and model class p𝜽(𝐱) are correct:


6.2 Alternative methods for learning in DLVMs

6.2.1 Maximum A Posteriori

From a Bayesian perspective, we can improve upon the maximum likelihood objective through maximum a posteriori (MAP) estimation, which maximizes the log-posterior w.r.t. θ. With i.i.d. data 𝒟, this is:

LMAP(θ) =logp(θ|𝒟) (6.4)
=logp(θ)+LML(θ)+constant (6.5)

The prior p(θ) in equation (6.5) has diminishing effect for increasingly large N. For this reason, in case of optimization with large datasets, we often choose to simply use the maximum likelihood criterion by omitting the prior from the objective, which is numerically equivalent to setting p(θ)=constant.

6.2.2 Variational EM with local variational parameters

Expectation Maximization (EM) is a general strategy for learning parameters in partially observed models (Dempster et al., 1977). See section 6.2.3 for a discussion of EM using MCMC. The method can be explained as coordinate ascent on the ELBO (Neal and Hinton, 1998). In case of of i.i.d. data, traditional variational EM methods estimate local variational parameters ϕ(i), i.e. a separate set of variational parameters per datapoint i in the dataset. In contrast, VAEs employ a strategy with global variational parameters.

EM starts out with some (random) initial choice of 𝜽 and ϕ(1:N). It then iteratively applies updates:

i=1,,N:ϕ(i)argmaxϕ(𝐱(i);𝜽,ϕ)   (E-step) (6.6)
𝜽argmax𝜽i=1N(𝐱(i);𝜽,ϕ)   (M-step) (6.7)

until convergence. Why does this work? Note that at the E-step:

argmaxϕ(𝐱;𝜽,ϕ) (6.8)
=argmaxϕ[logp𝜽(𝐱)-DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱))] (6.9)
=argminϕDKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱)) (6.10)

so the E-step, sensibly, minimizes the KL divergence of qϕ(𝐳|𝐱) from the true posterior.

Secondly, note that if qϕ(𝐳|𝐱) equals p𝜽(𝐳|𝐱), the ELBO equals the marginal likelihood, but that for any choice of qϕ(𝐳|𝐱), the M-step optimizes a bound on the marginal likelihood. The tightness of this bound is defined by DKL(qϕ(𝐳|𝐱)||p𝜽(𝐳|𝐱)).

6.2.3 MCMC-EM

Another Bayesian approach towards optimizing the likelihood p𝜽(𝐱) with DLVMs is Expectation Maximization (EM) with Markov Chain Monte Carlo (MCMC). In case of MCMC, the posterior is approximated by a mixture of a set of approximately i.i.d. samples from the posterior, acquired by running a Markov chain. Note that posterior gradients in DLVMs are relatively affordable to compute by differentiating the log-joint distribution w.r.t. 𝐳:

𝐳logp𝜽(𝐳|𝐱) =𝐳log[p𝜽(𝐱,𝐳)/p𝜽(𝐱)] (6.11)
=𝐳[logp𝜽(𝐱,𝐳)-logp𝜽(𝐱)] (6.12)
=𝐳logp𝜽(𝐱,𝐳)-𝐳logp𝜽(𝐱) (6.13)
=𝐳logp𝜽(𝐱,𝐳) (6.14)

One version of MCMC which uses such posterior for relatively fast convergence, is Hamiltonian MCMC (Neal, 2011). A disadvantage of this approach is the requirement for running an independent MCMC chain per datapoint.

6.3 Stochastic Gradient Descent

We work with directed models where the objective per datapoint is scalar, and due to the differentiability of neural networks that compose them, the objective is differentiable w.r.t. its parameters θ. Due to the remarkable efficiency of reverse-mode automatic differentiation (also known as the backpropagation algorithm (Rumelhart et al., 1988)), the value and gradient (i.e. the vector of partial derivatives) of differentiable scalar objectives can be computed with equal time complexity. In SGD, we iteratively update parameters θ:

θt+1θt+αtθL~(θ,ξ) (6.15)

where αt is a learning rate or preconditioner, and L~(θ,ξ) is an unbiased estimate of the objective L(θ), i.e. 𝔼ξp(ξ)[L~(θ,ξ)]=L(θ). The random variable ξ could e.g. be a datapoint index, uniformly sampled from {1,,N}, but can also include different types of noise such posterior sampling noise in VAEs. In experiments, we have typically used the Adam and Adamax optimization methods for choosing αt (Kingma and Ba, 2015); these methods are invariant to constant rescaling of the objective, and invariant to constant re-scalings of the individual gradients. As a result, L~(θ,ξ) only needs to be unbiased up to proportionality. We iteratively apply eq. (6.15) until a stopping criterion is met. A simple but effective criterion is to stop optimization as soon as the probability of a holdout set of data starts decreasing; this criterion is called early stopping.


  • Banerjee [2008] Arindam Banerjee. An Analysis of Logistic Models: Exponential Family Connections and Online Performance. 2008.
  • Bayer and Osendorfer [2014] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
  • Bengio and Thibodeau-Laufer [2013] Yoshua Bengio and Éric Thibodeau-Laufer. Deep generative stochastic networks trainable by backprop. arXiv preprint arXiv:1306.1091, 2013.
  • Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE, 2013.
  • Berg et al. [2018] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
  • Blei et al. [2012] David M Blei, Michael I Jordan, and John W Paisley. Variational Bayesian inference with Stochastic Search. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1367–1374, 2012.
  • Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
  • Bornschein et al. [2016] Jorg Bornschein, Samira Shabanian, Asja Fischer, and Yoshua Bengio. Bidirectional helmholtz machines. In International Conference on Machine Learning, pages 2511–2519, 2016.
  • Bourlard and Kamp [1988] Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4-5):291–294, 1988.
  • Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • Brock et al. [2016] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
  • Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • Chen et al. [2018] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
  • Chen et al. [2016] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
  • Chung et al. [2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
  • Cremer and Duvenaud [2017] Quaid Cremer, Chris ad Morris and David Duvenaud. Re-interpreting importance weighted autoencoders. ICLR 2017, 2017.
  • Dayan et al. [1995] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
  • Deco and Brauer [1995] Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. Advances in Neural Information Processing Systems, pages 247–254, 1995.
  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977.
  • Deshpande et al. [2016] Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, and David A Forsyth. Learning diverse image colorization. CoRR, abs/1612.01958, 1, 2016.
  • Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.
  • Dosovitskiy et al. [2014] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. arXiv preprint arXiv:1411.5928, 2014.
  • Dumoulin et al. [2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
  • Edwards and Storkey [2016] Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
  • Eslami et al. [2016] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In Advances In Neural Information Processing Systems, pages 3225–3233, 2016.
  • Fabius and van Amersfoort [2014] Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581, 2014.
  • Fan et al. [2015] Kai Fan, Ziteng Wang, Jeff Beck, James Kwok, and Katherine A Heller. Fast second order stochastic backpropagation for variational inference. In Advances in Neural Information Processing Systems, pages 1387–1395, 2015.
  • Fortunato et al. [2017] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv:1704.02798, 2017.
  • Fraccaro et al. [2016] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207, 2016.
  • Fu [2006] Michael C Fu. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.
  • Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.
  • Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. arXiv preprint arXiv:1502.03509, 2015.
  • Gershman and Goodman [2014] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In CogSci, 2014.
  • Glasserman [2013] Paul Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
  • Glynn [1990] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  • Gómez-Bombarelli et al. [2016] Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415, 2016.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016. URL
  • Graves [2011] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
  • Gregor et al. [2013] Karol Gregor, Andriy Mnih, and Daan Wierstra. Deep AutoRegressive Networks. arXiv preprint arXiv:1310.8499, 2013.
  • Gregor et al. [2015] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
  • Gregor et al. [2016] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pages 3549–3557, 2016.
  • Grover et al. [2017] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-gan: Bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868, 2017.
  • Gu et al. [2015] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.
  • Gulrajani et al. [2016] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
  • He et al. [2015a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015a.
  • He et al. [2015b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015b.
  • Heess et al. [2015a] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455, 2015a.
  • Heess et al. [2015b] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015b.
  • Hernández-Lobato et al. [2016] José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Daniel Hernández-Lobato, Thang Bui, and Richard Eric Turner. Black-box α-divergence minimization. 2016.
  • Higgins et al. [2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.
  • Hinton et al. [1995] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The "wake-sleep" algorithm for unsupervised neural networks. SCIENCE, pages 1158–1158, 1995.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8):1735–1780, 1997.
  • Hoffman and Johnson [2016] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
  • Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Houthooft et al. [2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
  • Hsu et al. [2016] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. Voice conversion from non-parallel corpora using variational auto-encoder. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific, pages 1–6. IEEE, 2016.
  • Hsu et al. [2017] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849, 2017.
  • Hu et al. [2017] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text generation. arXiv preprint arXiv:1703.00955, 2017.
  • Huang et al. [2018] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Johnson et al. [2016] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pages 2946–2954, 2016.
  • Jozefowicz et al. [2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2342–2350, 2015.
  • Kaae Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016.
  • Karl et al. [2016] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
  • Kavukcuoglu et al. [2008] Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. Technical Report CBLL-TR-2008-12-01, Computational and Biological Learning Lab, Courant Institute, NYU, 2008.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations 2015, 2015.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
  • Kingma et al. [2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
  • Kingma et al. [2015] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
  • Kingma et al. [2016] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • Kleijnen and Rubinstein [1996] Jack PC Kleijnen and Reuven Y Rubinstein. Optimization and sensitivity analysis of computer simulation models by the score function method. European Journal of Operational Research, 88(3):413–427, 1996.
  • Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Krishnan et al. [2015] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep Kalman filters. arXiv preprint arXiv:1511.05121, 2015.
  • Krishnan et al. [2017] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, pages 2101–2109, 2017.
  • Kucukelbir et al. [2016] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. arXiv preprint arXiv:1603.00788, 2016.
  • Kulkarni et al. [2015] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
  • Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
  • Larsen et al. [2015] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
  • Lázaro-Gredilla [2014] Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. ICML, 2014.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • Li and Turner [2016] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081, 2016.
  • Li et al. [2015] Yujia Li, Kevin Swersky, and Richard S Zemel. Generative moment matching networks. In ICML, pages 1718–1727, 2015.
  • Linsker [1989] Ralph Linsker. An application of the principle of maximum information preservation to linear systems. Morgan Kaufmann Publishers Inc., 1989.
  • Louizos and Welling [2016] Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
  • Louizos and Welling [2017] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
  • Louizos et al. [2015] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. arXiv preprint arXiv:1511.00830, 2015.
  • Louizos et al. [2017] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. arXiv preprint arXiv:1705.08665, 2017.
  • Maaløe et al. [2016] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
  • Maddison et al. [2016] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • Makhzani et al. [2015] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
  • Mansimov et al. [2015] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
  • Miao et al. [2016] Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In International Conference on Machine Learning, pages 1727–1736, 2016.
  • Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In The 31st International Conference on Machine Learning (ICML), 2014.
  • Mnih and Rezende [2016] Andriy Mnih and Danilo J Rezende. Variational inference for Monte Carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
  • Mohamed and Rezende [2015] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
  • Molchanov et al. [2017] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
  • Naesseth et al. [2017] Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics, pages 489–498, 2017.
  • Neal [2011] R Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2011.
  • Neal and Hinton [1998] Radford M Neal and Geoffrey E Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
  • Paisley et al. [2012] John Paisley, David Blei, and Michael Jordan. Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1367–1374, 2012.
  • Papamakarios et al. [2017] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335–2344, 2017.
  • Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprint arXiv:1703.01988, 2017.
  • Pu et al. [2016] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems, pages 2352–2360, 2016.
  • Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 814–822, 2014.
  • Ranganath et al. [2015] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. arXiv preprint arXiv:1511.02386, 2015.
  • Ravanbakhsh et al. [2016] Siamak Ravanbakhsh, Francois Lanusse, Rachel Mandelbaum, Jeff Schneider, and Barnabas Poczos. Enabling dark energy science with deep generative models of galaxy images. arXiv preprint arXiv:1609.05796, 2016.
  • Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pages 1530–1538, 2015.
  • Rezende et al. [2014] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1278–1286, 2014.
  • Rezende et al. [2016a] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances In Neural Information Processing Systems, pages 4997–5005, 2016a.
  • Rezende et al. [2016b] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016b.
  • Roeder et al. [2017] Geoffrey Roeder, Yuhuai Wu, and David Duvenaud. Sticking the landing: An asymptotically zero-variance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.
  • Rosca et al. [2018] Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distribution matching in variational inference. arXiv preprint arXiv:1802.06847, 2018.
  • Roweis [1998] Sam Roweis. EM algorithms for PCA and SPCA. Advances in neural information processing systems, pages 626–632, 1998.
  • Ruiz et al. [2016] Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pages 460–468, 2016.
  • Rumelhart et al. [1988] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Salakhutdinov and Larochelle [2010] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, pages 693–700, 2010.
  • Salimans [2016] Tim Salimans. A structured variational auto-encoder for learning deep hierarchies of sparse features. arXiv preprint arXiv:1602.08734, 2016.
  • Salimans and Knowles [2013] Tim Salimans and David A Knowles. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4), 2013.
  • Salimans et al. [2015] Tim Salimans, Diederik P Kingma, and Max Welling. Markov Chain Monte Carlo and variational inference: Bridging the gap. In ICML, volume 37, pages 1218–1226, 2015.
  • Semeniuta et al. [2017] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A hybrid convolutional variational autoencoder for text generation. arXiv preprint arXiv:1702.02390, 2017.
  • Serban et al. [2016] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069, 2016.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
  • Sønderby et al. [2016a] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, S Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016a.
  • Sønderby et al. [2016b] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016b.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  • Tomczak and Welling [2016] Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630, 2016.
  • Tomczak and Welling [2017] Jakub M Tomczak and Max Welling. Improving variational auto-encoders using convex combination linear inverse autoregressive flow. arXiv preprint arXiv:1706.02326, 2017.
  • Tran et al. [2015] Dustin Tran, Rajesh Ranganath, and David M Blei. The variational Gaussian process. arXiv preprint arXiv:1511.06499, 2015.
  • Tran et al. [2017] Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene Brevdo, Kevin Murphy, and David M Blei. Deep probabilistic programming. arXiv preprint arXiv:1701.03757, 2017.
  • van den Oord et al. [2016a] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016a.
  • van den Oord et al. [2016b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016b.
  • Vincent et al. [2010] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
  • Watter et al. [2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pages 2746–2754, 2015.
  • White [2016] Tom White. Sampling generative networks: Notes on a few effective techniques. arXiv preprint arXiv:1609.04468, 2016.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Wingate and Weber [2013] David Wingate and Theophane Weber. Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299, 2013.
  • Xu et al. [2017] Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for semi-supervised text classification. In AAAI, pages 3358–3364, 2017.
  • Yan et al. [2016] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
  • Yang et al. [2017] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017.
  • Yeh et al. [2016] Raymond Yeh, Ziwei Liu, Dan B Goldman, and Aseem Agarwala. Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961, 2016.
  • Zhao et al. [2017] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960, 2017.