Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse

  • 2019-11-06 16:34:04
  • James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi
  • 3

Abstract

Posterior collapse in Variational Autoencoders (VAEs) arises when thevariational posterior distribution closely matches the prior for a subset oflatent variables. This paper presents a simple and intuitive explanation forposterior collapse through the analysis of linear VAEs and their directcorrespondence with Probabilistic PCA (pPCA). We explain how posterior collapsemay occur in pPCA due to local maxima in the log marginal likelihood.Unexpectedly, we prove that the ELBO objective for the linear VAE does notintroduce additional spurious local maxima relative to log marginal likelihood.We show further that training a linear VAE with exact variational inferencerecovers an identifiable global maximum corresponding to the principalcomponent directions. Empirically, we find that our linear analysis ispredictive even for high-capacity, non-linear VAEs and helps explain therelationship between the observation noise, local maxima, and posteriorcollapse in deep Gaussian VAEs.

 

Quick Read (beta)

Don’t Blame the ELBO!
A Linear VAE Perspective on Posterior Collapse

James Lucas,  George Tucker,  Roger Grosse,  Mohammad Norouzi

  University of Toronto             Google Brain
Intern at Google Brain
Abstract

Posterior collapse in Variational Autoencoders (VAEs) arises when the variational posterior distribution closely matches the prior for a subset of latent variables. This paper presents a simple and intuitive explanation for posterior collapse through the analysis of linear VAEs and their direct correspondence with Probabilistic PCA (pPCA). We explain how posterior collapse may occur in pPCA due to local maxima in the log marginal likelihood. Unexpectedly, we prove that the ELBO objective for the linear VAE does not introduce additional spurious local maxima relative to log marginal likelihood. We show further that training a linear VAE with exact variational inference recovers an identifiable global maximum corresponding to the principal component directions. Empirically, we find that our linear analysis is predictive even for high-capacity, non-linear VAEs and helps explain the relationship between the observation noise, local maxima, and posterior collapse in deep Gaussian VAEs.

\newfloatcommand

capbtabboxtable[][\FBwidth]

 

Don’t Blame the ELBO!
A Linear VAE Perspective on Posterior Collapse


  James Lucasthanks: Intern at Google Brain,  George Tucker,  Roger Grosse,  Mohammad Norouzi   University of Toronto             Google Brain

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

The generative process of a deep latent variable model entails drawing a number of latent factors from the prior and using a neural network to convert such factors to real data points. Maximum likelihood estimation of the parameters requires marginalizing out the latent factors, which is intractable for deep latent variable models. The influential work of Kingma and Welling (2013) and Rezende et al. (2014) on Variational Autoencoders (VAEs) enables optimization of a tractable lower bound on the likelihood via a reparameterization of the Evidence Lower Bound (ELBO) (Jordan et al., 1999; Blei et al., 2017). This has led to a surge of recent interest in automatic discovery of the latent factors of variation for a data distribution based on VAEs and principled probabilistic modeling (Higgins et al., 2016; Bowman et al., 2015; Chen et al., 2018; Gomez-Bombarelli et al., 2018). Code available at https://sites.google.com/view/dont-blame-the-elbo

Unfortunately, the quality and the number of the latent factors learned is influenced by a phenomenon known as posterior collapse, where the generative model learns to ignore a subset of the latent variables. Most existing papers suggest that posterior collapse is caused by the KL-divergence term in the ELBO objective, which directly encourages the variational distribution to match the prior (Bowman et al., 2015; Kingma et al., 2016; Sønderby et al., 2016). Thus, a wide range of heuristic approaches in the literature have attempted to diminish the effect of the KL term in the ELBO to alleviate posterior collapse (Bowman et al., 2015; Razavi et al., 2019; Sønderby et al., 2016; Huang et al., 2018). While holding the KL term responsible for posterior collapse makes intuitive sense, the mathematical mechanism of this phenomenon is not well understood. In this paper, we investigate the connection between posterior collapse and spurious local maxima in the ELBO objective through the analysis of linear VAEs. Unexpectedly, we show that spurious local maxima may arise even in the optimization of exact marginal likelihood, and such local maxima are linked with a collapsed posterior.

While linear autoencoders (Rumelhart et al., 1985) have been studied extensively (Baldi and Hornik, 1989; Kunin et al., 2019), little attention has been given to their variational counterpart from a theoretical standpoint. A well-known relationship exists between linear autoencoders and PCA – the optimal solution of a linear autoencoder has decoder weight columns that span the same subspace as the one defined by the principal components (Baldi and Hornik, 1989). Similarly, the maximum likelihood solution of probabilistic PCA (pPCA) (Tipping and Bishop, 1999) recovers the subspace of principal components. In this work, we show that a linear variational autoencoder can recover the solution of pPCA. In particular, by specifying a diagonal covariance structure on the variational distribution, one can recover an identifiable autoencoder, which at the global maximum of the ELBO recovers the exact principal components as the columns of the decoder’s weights. Importantly, we show that the ELBO objective for a linear VAE does not introduce any local maxima beyond the log marginal likelihood.

The study of linear VAEs gives us new insights into the cause of posterior collapse and the difficulty of VAE optimization more generally. Following the analysis of Tipping and Bishop (1999), we characterize the stationary points of pPCA and show that the variance of the observation model directly influences the stability of local stationary points corresponding to posterior collapse – it is only possible to escape these sub-optimal solutions by simultaneously reducing noise and learning better features. Our contributions include:

  • We verify that linear VAEs can recover the true posterior of pPCA. Further, we prove that the global optimum of the linear VAE recovers the principal components (not just their spanning sub-space). More importantly, we prove that using ELBO to train linear VAEs does not introduce any additional spurious local maxima relative to log marginal likelihood training.

  • While high-capacity decoders are often blamed for posterior collapse, we show that posterior collapse may occur when optimizing log marginal likelihood even without powerful decoders. Our experiments verify the analysis of the linear setting and show that these insights extend even to high-capacity non-linear VAEs. Specifically, we provide evidence that the observation noise in deep Gaussian VAEs plays a crucial role in overcoming local maxima corresponding to posterior collapse.

2 Preliminaries

Probabilistic PCA.

The probabilitic PCA (pPCA) model is defined as follows. Suppose latent variables 𝐳k generate data 𝐱n. A standard Gaussian prior is used for 𝐳 and a linear generative model with a spherical Gaussian observation model for 𝐱:

p(𝐳)=𝒩(𝟎,𝐈),p(𝐱𝐳)=𝒩(𝐖𝐳+𝝁,σ2𝐈). (1)

The pPCA model is a special case of factor analysis (Bartholomew, 1987), which uses a spherical covariance σ2𝐈 instead of a full covariance matrix. As pPCA is fully Gaussian, both the marginal distribution for 𝐱 and the posterior p(𝐳𝐱) are Gaussian, and unlike factor analysis, the maximum likelihood estimates of 𝐖 and σ2 are tractable (Tipping and Bishop, 1999).

Variational Autoencoders.

Recently, amortized variational inference has gained popularity as a means to learn complicated latent variable models. In these models, the log marginal likelihood, logp(𝐱), is intractable but a variational distribution, denoted q(𝐳𝐱), is used to approximate the posterior p(𝐳𝐱), allowing tractable approximate inference using the Evidence Lower Bound (ELBO):

logp(𝐱) = 𝔼q(𝐳𝐱)[logp(𝐱,𝐳)-logq(𝐳𝐱)]+DKL(q(𝐳𝐱)||p(𝐳𝐱)) (2)
𝔼q(𝐳𝐱)[logp(𝐱,𝐳)-logq(𝐳𝐱)] (3)
= 𝔼q(𝐳𝐱)[logp(𝐱𝐳)]-DKL(q(𝐳𝐱)||p(𝐳))  (:=ELBO) (4)

The ELBO (Jordan et al., 1999; Blei et al., 2017) consists of two terms, the KL divergence between the variational distribution, q(𝐳|𝐱), and prior, p(𝐳), and the expected conditional log-likelihood. The KL divergence forces the variational distribution towards the prior and so has reasonably been the focus of many attempts to alleviate posterior collapse. We hypothesize that the log marginal likelihood itself often encourages posterior collapse.

In Variational Autoencoders (VAEs), two neural networks are used to parameterize qϕ(𝐳|𝐱) and pθ(𝐱|𝐳), where ϕ and θ denote two sets of neural network weights. The encoder maps an input 𝐱 to the parameters of the variational distribution, and then the decoder maps a sample from the variational distribution back to the inputs.

Posterior collapse.

A dominant issue with VAE optimization is posterior collapse, in which the learned variational distribution is close to the prior. This reduces the capacity of the generative model, making it impossible for the decoder network to make use of the information content of all of the latent dimensions. While posterior collapse is widely acknowledged, formally defining it has remained a challenge. We introduce a formal definition in Section 6.2 which we use to measure posterior collapse in trained deep neural networks.

a) σ2=λ4 b) σ2=λ6 c) σ2=λ8
Figure 1: Stationary points of pPCA. Two zero-columns of 𝐖 are perturbed in the directions of two orthogonal principal components (μ5 and μ7) and the optimization landscape around zero-columns is shown, where the goal is to maximize log marginal likelihood. The stability of the stationary points depends critically on σ2 (the observation noise). Left: σ2 is too large to capture either principal component. Middle: σ2 is too large to capture one of the principal components. Right: σ2 is able to capture both principal components.

3 Related Work

Dai et al. (2017) discuss the relationship between robust PCA methods (Candès et al., 2011) and VAEs. They show that at stationary points the VAE objective locally aligns with pPCA under certain assumptions. We study the pPCA objective explicitly and show a direct correspondence with linear VAEs. Dai et al. (2017) showed that the covariance structure of the variational distribution may smooth out the loss landscape. This is an interesting result whose interactions with ours is an exciting direction for future research.

He et al. (2019) motivate posterior collapse through an investigation of the learning dynamics of deep VAEs. They suggest that posterior collapse is caused by the inference network lagging behind the true posterior during the early stages of training. A related line of research studies issues arising from approximate inference causing a mismatch between the variational distribution and true posterior (Cremer et al., 2018; Kim et al., 2018; Hjelm et al., 2016). By contrast, we show that posterior collapse may exist even when the variational distribution matches the true posterior exactly.

Alemi et al. (2017) used an information theoretic framework to study the representational properties of VAEs. They show that with infinite model capacity there are solutions with equal ELBO and log marginal likelihood which span a range of representations, including posterior collapse. We find that even with weak (linear) decoders, posterior collapse may occur. Moreover, we show that in the linear case this posterior collapse is due entirely to the log marginal likelihood.

The most common approach for dealing with posterior collapse is to anneal a weight on the KL term during training from 0 to 1 (Bowman et al., 2015; Sønderby et al., 2016; Maaløe et al., 2019; Higgins et al., 2016; Huang et al., 2018). Unfortunately, this means that during the annealing process, one is no longer optimizing a bound on the log-likelihood. Also, it is difficult to design these annealing schedules and we have found that once regular ELBO training resumes the posterior will typically collapse again (Section 6.2).

Kingma et al. (2016) propose a constraint on the KL term, termed "free-bits", where the gradient of the KL term per dimension is ignored if the KL is below a given threshold. Unfortunately, this method reportedly has some negative effects on training stability (Razavi et al., 2019; Chen et al., 2016). Delta-VAEs (Razavi et al., 2019) instead choose prior and variational distributions such that the variational distribution can never exactly recover the prior, allocating free-bits implicitly. Several other papers have studied alternative formulations of the VAE objective (Rezende and Viola, 2018; Dai and Wipf, 2019; Alemi et al., 2017; Ma et al., 2019; Yeung et al., 2017). Dai and Wipf (2019) analyzed the VAE objective to improve image fidelity under Gaussian observation models and also discuss the importance of the observation noise. Other approaches have explored changing the VAE network architecture to help alleviate posterior collapse; for example adding skip connections (Maaløe et al., 2019; Dieng et al., 2018)

Rolinek et al. (2018) observed that the diagonal covariance used in the variational distribution of VAEs encourages orthogonal representations. They use linearizations of deep networks to prove their results under a modification of the objective function by explicitly ignoring latent dimensions with posterior collapse. Our formulation is distinct in focusing on linear VAEs without modifying the objective function and proving an exact correspondence between the global solution of linear VAEs and the principal components.

Kunin et al. (2019) studied the optimization challenges in the linear autoencoder setting. They exposed an equivalence between pPCA and Bayesian autoencoders and point out that when σ2 is too large information about the latent code is lost. A similar phenomenon is discussed in the supervised learning setting by Chechik et al. (2005). Kunin et al. (2019) also showed that suitable regularization allows the linear autoencoder to recover the principal components up to rotations. We show that linear VAEs with a diagonal covariance structure recover the principal components exactly.

4 Analysis of linear VAE

This section compares and analyzes the loss landscapes of both pPCA and linear variational autoencoders. We first discuss the stationary points of pPCA and then show that a simple linear VAE can recover the global optimum of pPCA. Moreover, when the data covariance eigenvalues are distinct, the linear VAE identifies the individual principal components, unlike pPCA, which recovers only the PCA subspace. Finally, we prove that ELBO does not introduce any additional spurious maxima to the loss landscape.

4.1 Probabilistic PCA Revisited

The pPCA model (Eq. (1)) is a fully Gaussian linear model, thus we can compute both the marginal distribution for 𝐱 and the posterior p(𝐳𝐱) in closed form:

p(𝐱) = 𝒩(𝝁,𝐖𝐖+σ2𝐈), (5)
p(𝐳𝐱) = 𝒩(𝐌-1𝐖(𝐱-𝝁),σ2𝐌-1), (6)

where 𝐌=𝐖𝐖+σ2𝐈. This model is particularly interesting to analyze in the setting of variational inference, as the ELBO can also be computed in closed form (see Appendix C).

Stationary points of pPCA

We now characterize the stationary points of pPCA, largely repeating the thorough analysis of Tipping and Bishop (1999) (see Appendix A of their paper). The maximum likelihood estimate of 𝝁 is the mean of the data. We can compute 𝐖MLE and σMLE2 as follows:

σMLE2 = 1n-kj=k+1nλj, (7)
𝐖MLE = 𝐔k(𝚲k-σMLE2𝐈)1/2𝐑. (8)

Here 𝐔k corresponds to the first k principal components of the data with the corresponding eigenvalues λ1,,λk stored in the k×k diagonal matrix 𝚲k. The matrix 𝐑 is an arbitrary rotation matrix which accounts for weak identifiability in the model. We can interpret σMLE2 as the average variance lost in the projection. The MLE solution is the global optimum. Other stationary points correspond to zeroing out columns of 𝐖MLE (posterior collapse).

Stability of 𝐖MLE

In this section we consider σ2 to be fixed and not necessarily equal to the MLE solution. Equation 8 remains a stationary point when the general σ2 is swapped in. One surprising observation is that σ2 directly controls the stability of the stationary points of the log marginal likelihood (see Appendix A). In Figure 1, we illustrate one such stationary point of pPCA for different values of σ2. We computed this stationary point by taking 𝐖 to have three principal component columns and zeros elsewhere. Each plot shows the same stationary point perturbed by two orthogonal vectors corresponding to other principal components.

The stability of the pPCA stationary points depends on the size of σ2 — as σ2 increases the stationary point tends towards a stable local maximum so that we cannot learn the additional components. Intuitively, the model prefers to explain deviations in the data with the larger observation noise. Fortunately, decreasing σ2 will increase likelihood at these stationary points so that when learning σ2 simultaneously these stationary points are saddle points (Tipping and Bishop, 1999). Therefore, learning σ2 is necessary for gaining a full latent representation.

4.2 Linear VAEs recover pPCA

We now show that linear VAEs can recover the globally optimal solution to Probabilistic PCA. We will consider the following VAE model,

p(𝐱𝐳)=𝒩(𝐖𝐳+𝝁,σ2𝐈),q(𝐳𝐱)=𝒩(𝐕(𝐱-𝝁),𝐃), (9)

where 𝐃 is a diagonal covariance matrix, used globally for all of the data points. While this is a significant restriction compared to typical VAE architectures, which define an amortized variance for each input point, this is sufficient to recover the global optimum of the probabilistic model.

Lemma 1.

The global maximum of the ELBO objective (Eq. (4)) for the linear VAE (Eq. (9)) is identical to the global maximum for the log marginal likelihood of pPCA (Eq. (5)).

Proof.

Note that the global optimum of pPCA is defined up to an orthogonal transformation of the columns of 𝐖, i.e., any rotation 𝐑 in Eq. (8) results in a matrix 𝐖MLE that given σMLE2 attains maximum marginal likelihood. The linear VAE model defined in Eq. (9) is able to recover the global optimum of pPCA when 𝐑=𝐈. Recall from Eq. (6) that p(𝐳𝐱) is defined in terms of 𝐌=𝐖𝐖+σ2𝐈. When 𝐑=𝐈, we obtain 𝐌=𝐖MLE𝐖MLE+σMLE2𝐈=𝚲k, which is diagonal. Thus, setting 𝐕=𝐌-1𝐖MLE and 𝐃=σMLE2𝐌-1=σMLE2𝚲k-1, recovers the true posterior with diagonal covariance at the global optimum. In this case, the ELBO equals the log marginal likelihood and is maximized when the decoder has weights 𝐖=𝐖MLE. Because the ELBO lower bounds log-likelihood, the global maximum of the ELBO for the linear VAE is the same as the global maximum of the marginal likelihood for pPCA. ∎

The result of Lemma 1 is somewhat expected because the posterior of pPCA is Gaussian. Further details are given in Appendix C. In addition, we prove a more surprising result that suggests restricting the variational distribution to a Gaussian with a diagonal covariance structure allows one to identify the principal components at the global optimum of ELBO.

Corollary 1.

The global maximum of the ELBO objective (Eq. (4)) for the linear VAE (Eq. (9)) has the scaled principal components as the columns of the decoder network.

Proof.

Follows directly from the proof of Lemma 1 and Eq. (8). ∎

We discuss this result in Appendix B. This full identifiability is non-trivial and is not achieved even with the regularized linear autoencoder (Kunin et al., 2019).

So far, we have shown that at its global optimum the linear VAE recovers the pPCA solution, which enforces orthogonality of the decoder weight columns. However, the VAE is trained with the ELBO rather than the log marginal likelihood — often using SGD. The majority of existing work suggests that the KL term in the ELBO objective is responsible for posterior collapse. So, we should ask whether this term introduces additional spurious local maxima. Surprisingly, for the linear VAE model the ELBO objective does not introduce any additional spurious local maxima. We provide a sketch of the proof below with full details in Appendix C.

Theorem 1.

The ELBO objective for a linear VAE does not introduce any additional local maxima to the pPCA model.

Proof.

(Sketch) If the decoder has orthogonal columns, then the variational distribution recovers the true posterior at stationary points. Thus, the variational objective will exactly recover the log marginal likelihood. If the decoder does not have orthogonal columns then the variational distribution is no longer tight. However, the ELBO can always be increased by applying an infinitesimal rotation to the right-singular vectors of the decoder towards identity: 𝐖𝐖𝐑ϵ (so that the decoder columns are closer to orthogonal). This works because the variational distribution can fit the posterior more closely while the log marginal likelihood is invariant to rotations of the weight columns. Thus, any additional stationary points in the ELBO objective must necessarily be saddle points. ∎

The theoretical results presented in this section provide new intuition for posterior collapse in VAEs. In particular, the KL between the variational distribution and the prior is not entirely responsible for posterior collapse — log marginal likelihood has a role. The evidence for this is two-fold. We have shown that log marginal likelihood may have spurious local maxima but also that in the linear case the ELBO objective does not add any additional spurious local maxima. Rephrased, in the linear setting the problem lies entirely with the probabilistic model. We should then ask, to what extent do these results hold in the non-linear setting?

5 Deep Gaussian VAEs

The deep Gaussian VAE consists of a decoder Dθ and an encoder Eϕ. The ELBO objective can be expressed as,

(𝐱;θ,ϕ)=-KL(qϕ(𝐳𝐱)p(𝐳))-12σ2𝔼qϕ(𝐳|𝐱)[Dθ(𝐳)-𝐱2]-12log(2πσ2) (10)

The role of σ2 in this objective invites a natural comparison to the β-VAE objective (Higgins et al., 2016), where the KL term is weighted by β+. Alemi et al. (2017) propose using small β values to force powerful decoders to utilize the latent variables, but this comes at the cost of poor ELBO. Practitioners must then use downstream task performance for model selection, thus sacrificing one of the primary benefits of likelihood-based models. However, for a given β, one can find a corresponding σ2 (and a learning rate) such that the gradient updates to the network parameters are identical. Importantly, the Gaussian partition function for a Gaussian observation model (the last term on the RHS of Eq. (10)) prevents ELBO from deviating from the β-VAE’s objective with a β-weighted KL term while maintaining the benefits to representation learning when σ2 is small. For the Gaussian VAE, this helps connect the dots between the role of local maxima and observation noise in posterior collapse vs. heuristic approaches that attempted to alleviate posterior collapse by diminishing the effect of the KL term (Bowman et al., 2015; Razavi et al., 2019; Sønderby et al., 2016; Huang et al., 2018). In the following section, we will study the nonlinear VAE empirically and explore connections to the linear theory.

6 Experiments

In this section, we present empirical evidence found from studying two distinct claims. First, we verify our theoretical analysis of the linear VAE model. Second, we explore to what extent these insights apply to deep nonlinear VAEs.

6.1 Linear VAEs

We ran two sets of experiments on 1000 randomly chosen MNIST images. First, we trained linear VAEs with learnable σ2 for a range of hidden dimensions11 1 The VAEs were trained using the analytic ELBO (Appendix C.1) and without mini-batching gradients.. For each model, we compared the final ELBO to the maximum-likelihood of pPCA finding them to be essentially indistinguishable (as predicted by Lemma 1 and Theorem 1). For the second set of experiments, we took the pPCA MLE solution for 𝐖 for each number of hidden dimensions and computed the likelihood under the observation noise which maximizes likelihood for 50 hidden dimensions. We observed that adding additional principal components (after 50) will initially improve likelihood but eventually adding more components (after 200) actually decreases the likelihood. In other words, the collapsed solution is actually preferred if the observation noise is not set correctly — we observe this theoretically through the stability of the stationary points (e.g. Figure 1).

Figure 2: The log marginal likelihood and optimal ELBO of MNIST pPCA solutions over increasing hidden dimension. Green represents the MLE solution (global maximum), the red dashed line is the optimal ELBO solution which matches the global optimum. The blue line shows the log marginal likelihood of the solutions using the full decoder weights when σ2 is fixed to its MLE solution for 50 hidden dimensions.
Figure 3: Stochastic vs analytic ELBO training: using the analytic gradient of the ELBO led to faster convergence and better final ELBO (950.7 vs. 939.3).
Figure 4: VAEs with linear decoders trained on real-valued MNIST with nonlinear preprocessing (Papamakarios et al., 2017). Final average ELBO on training set are (ordered by legend): -1098.2, -1108.7, -1112.1, -1119.6.
Effect of stochastic ELBO estimates

In general, we are unable to compute the ELBO in closed form and so instead rely on unbiased Monte Carlo estimates using the reparameterization trick. These estimates add high-variance noise and can make optimization more challenging (Kingma and Welling, 2013). In the linear model, we can compare the solutions obtained using the stochastic ELBO gradients versus the analytic ELBO22 2 We use 1000 MNIST images, as before, to enable full-batch training so that the only source of noise is from the reparameterization trick (Kingma and Welling, 2013) (Figure 3). Additional experimental details are in Appendix E. We found that stochastic optimization had slower convergence (when compared to analytic training with the same learning rate) and, unsurprisingly, reached a worse final training ELBO value (in other words, worse steady-state risk due to the gradient variance).

Nonlinear Encoders

With a linear decoder and nonlinear encoder, Lemma 1 still holds, and the optimal variational distribution is the same as the true posterior has not changed. However, Corollary 1 and Theorem 1 no longer hold in general. Even a deep linear encoder will not have a unique global maximum and new stationary points (possibly maxima) may be introduced to ELBO in general. To investigate how deeper networks may impact optimization of the probabilistic model, we trained linear decoders with varying encoders using ELBO. We do not expect the linear encoder to be outperformed and indeed the empirical results support this (Figure 4).

6.2 Investigating posterior collapse in deep nonlinear VAEs

We explored how the analysis of the linear VAEs extends to deep nonlinear models. To do so, we trained VAEs with Gaussian observation models on the MNIST (LeCun, 1998) and CelebA (Liu et al., 2015) datasets. We apply uniform dequantization as in Papamakarios et al. (2017) in each case. We also adopt the nonlinear logit preprocessing transformation from Papamakarios et al. (2017) to provide fair comparisons with existing work. We also report results of models trained directly in pixel space in the appendix (there is no significant difference for the hypotheses we test).

Measuring posterior collapse

In order to measure the extent of posterior collapse, we introduce the following definition. We say that latent dimension dimension i has (ϵ,δ)-collapsed if 𝐱p[KL(q(zi|𝐱)||p(zi))<ϵ]1-δ. Note that the linear VAE can suffer (0,0)-collapse. To estimate this practically, we compute the proportion of data samples which induce a variational distribution with KL divergence less than ϵ and finally report the percentage of dimensions which have (ϵ,δ)-collapsed. Throughout this work, we fix δ=0.01 and vary ϵ.

Investigating σ2

We trained MNIST VAEs with 2 hidden layers in both the decoder and encoder, ReLU activations, and 200 latent dimensions. We first evaluated training with fixed values of the observation noise, σ2. This mirrors many public VAE implementations where σ2 is fixed to 1 throughout training (also observed by Dai and Wipf (2019)), however, our linear analysis suggests that this is suboptimal. Then, we consider the setting where the observation noise and VAE weights are learned simultaneously.

In Table 1 we report the final ELBO of nonlinear VAEs trained on real-valued MNIST. For fixed σ2, we found that the final models could have significant differences in ELBO which were maintained even after tuning σ2 to the learned representations — the converged representations are less good when σ2 is too large as predicted by the linear model. Additionally, we report the final ELBO values when the model is trained while learning σ2 with different initial values of σ2. The gap in performance across different initializations is smaller than for fixed σ2 but is still significant. The linear VAE does not predict this gap which suggests that learning σ2 correctly is more challenging in the nonlinear case.

Model ELBO σ2-tuned ELBO Tuned σ2 Posterior KL
Init σ2 Final σ2 collapse (%) Divergence

MNIST

10.0 -1450.3±4.2 -1098.2±28.3 1.797 89.88 28.8±1.4
1.0 -1022.1±5.4 -1018.3±5.3 1.145 27.38 125.4±4.2
0.1 -3697.3±493.3 -1190.8±37.4 0.968 3.25 368.7±94.6
0.01 -38612.5±1189.8 -2090.8±975.1 0.877 0.00 695.9±118.1
0.001 -504259.1±49149.8 -1744.7±48.4 0.810 0.00 756.2±12.6
10.0 1.320 -1022.2±4.5 -1022.3±4.6 1.318 73.75 73.8±9.8
1.0 1.183 -1011.1±2.7 -1011.1±2.8 1.182 47.88 106.3±2.5
0.1 1.194 -1025.4±8.6 -1025.4±8.6 1.195 29.25 116.1±11.4
0.01 1.194 -1030.6±3.5 -1030.5±3.5 1.191 23.00 121.9±7.7
0.001 1.208 -1038.7±5.6 -1038.8±5.6 1.209 27.00 124.9±1.6

CELEBA 64

10.0 -73328.4±0.49 -55186.7±35.1 0.2040 80.56 56.12±0.4
1.0 -59841.8±30.1 -51294.8±333.7 0.1020 2.52 213.4±6.3
0.1 -50760.3±353.4 -50698.5±393.9 0.0883 32.72 483.8±36.2
0.01 -82478.7±1823.3 -51373.9±213.3 0.0817 0.00 1624.2±8.8
0.001 -531924.5±17177.6 -57381.5±512.6 0.0296 0.00 2680.2±41.5
10.0 0.0962 -51109.5±408.2 -51109.5±408.3 0.0963 53.32 364.5±26.4
1.0 0.0875 -50631.2±163.4 -50631.0±163.3 0.0875 54.76 462.2±20.0
0.1 0.0863 -50646.9±269.0 -50645.9±267.5 0.0869 28.84 520.9±11.7
0.01 0.0911 -51285.0±708.1 -51284.8±708.1 0.0963 5.64 557.0±50.5
0.001 0.1040 -51695.1±322.4 -51694.8±322.7 0.0974 0.00 537.5±46.2
Table 1: Evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued MNIST. We report the ELBO on the training set in all cases. Collapse percent gives the percentage of latent dimensions which are within 0.01 KL of the prior for at least 99% of the encoder inputs.
Figure 5: Posterior collapse percentage as a function of ϵ-threshold for a deep VAE trained on MNIST. We measure posterior collapse for trained networks as the proportion of latent dimensions that are within ϵ KL divergence of the prior for at least a 1-δ proportion of the training data points (δ=0.01 in the plots).
Figure 6: Posterior collapse percentage as a function of ϵ-threshold for a deep VAE trained on MNIST. We measure posterior collapse for trained networks as the proportion of latent dimensions that are within ϵ KL divergence of the prior for at least a 1-δ proportion of the training data points (δ=0.01 in the plots).

Despite the large volume of work studying posterior collapse it has not been measured in a consistent way (or even defined so). In Figure 5 and Figure 6 we measure posterior collapse for trained networks as described above (we chose δ=0.01). By considering a range of ϵ values we found this was (moderately) robust to stochasticity in data preprocessing. We observed that for large choices of σ2 initialization the variational distribution matches the prior closely. This was true even when σ2 is learned — suggesting that local optima may contribute to posterior collapse in deep VAEs.

CelebA VAEs

We trained deep convolutional VAEs with 500 hidden dimensions on images from the CelebA dataset (resized to 64x64). We trained the CelebA VAEs with different fixed values of σ2 and compared the ELBO before and after tuning σ2 to the learned representations (Table 1). Further, we explored training the CelebA VAE while learning σ2 over varied initializations of the observation noise. The VAE is sensitive to the initialization of the observation noise even when σ2 is learned (in particular, in terms of the number of collapsed dimensions).

7 Discussion

By analyzing the correspondence between linear VAEs and pPCA, this paper makes significant progress towards understanding the causes of posterior collapse. We show that for simple linear VAEs posterior collapse is caused by ill-conditioning of the stationary points in the log marginal likelihood objective. We demonstrate empirically that the same optimization issues play a role in deep non-linear VAEs. Finally, we find that linear VAEs are useful theoretical test-cases for evaluating existing hypotheses on VAEs and we encourage researchers to consider studying their hypotheses in the linear VAE setting.

8 Acknowledgements

This work was guided by many conversations with and feedback from our colleagues. In particular, we thank Durk Kingma, Alex Alemi, and Guodong Zhang for invaluable feedback on early versions of this work.

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: Appendix E.
  • A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy (2017) Fixing a broken ELBO. arXiv preprint arXiv:1711.00464. Cited by: Appendix D, §3, §3, §5.
  • J. Atchison and S. M. Shen (1980) Logistic-normal distributions: some properties and uses. Biometrika 67 (2), pp. 261–272. Cited by: §C.3.
  • P. Baldi and K. Hornik (1989) Neural networks and principal component analysis: learning from examples without local minima. Neural networks 2 (1), pp. 53–58. Cited by: §1.
  • D. J. Bartholomew (1987) Latent variable models and factors analysis. Oxford University Press, Inc.. Cited by: §2.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association. Cited by: §1, §2.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §C.3, §1, §1, §3, §5.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011) Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §3.
  • G. Chechik, A. Globerson, N. Tishby, and Y. Weiss (2005) Information bottleneck for gaussian variables. Journal of machine learning research 6 (Jan), pp. 165–188. Cited by: §3.
  • R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems. Cited by: §1.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §3.
  • C. Cremer, X. Li, and D. Duvenaud (2018) Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558. Cited by: §3.
  • B. Dai, Y. Wang, J. Aston, G. Hua, and D. Wipf (2017) Hidden talents of the variational autoencoder. arXiv preprint arXiv:1706.05148. Cited by: Appendix D, §3.
  • B. Dai and D. Wipf (2019) Diagnosing and enhancing VAE models. In International Conference on Learning Representations, Cited by: Appendix D, §3, §6.2.
  • A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei (2018) Avoiding latent variable collapse with generative skip models. arXiv preprint arXiv:1807.04863. Cited by: §3.
  • R. Gomez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernandez-Lobato, B. n. Sanchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. American Chemical Society Central Science. Cited by: §1.
  • J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick (2019) Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations, Cited by: §3.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Cited by: Appendix E, §1, §3, §5.
  • D. Hjelm, R. R. Salakhutdinov, K. Cho, N. Jojic, V. Calhoun, and J. Chung (2016) Iterative refinement of the approximate posterior for directed belief networks. In Advances in Neural Information Processing Systems, Cited by: §3.
  • C. Huang, S. Tan, A. Lacoste, and A. C. Courville (2018) Improving explorability in variational inference with annealed variational objectives. In Advances in Neural Information Processing Systems, Cited by: §1, §3, §5.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine learning. Cited by: §1, §2.
  • Y. Kim, S. Wiseman, A. C. Miller, D. Sontag, and A. M. Rush (2018) Semi-amortized variational autoencoders. arXiv preprint arXiv:1802.02550. Cited by: §3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §6.1, footnote 2.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1, §3.
  • D. Kunin, J. M. Bloom, A. Goeva, and C. Seed (2019) Loss landscapes of regularized linear autoencoders. arXiv preprint arXiv:1901.08168. Cited by: Appendix B, §1, §3, §4.2.
  • Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §6.2.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: Appendix E, §6.2.
  • X. Ma, C. Zhou, and E. Hovy (2019) MAE: mutual posterior-divergence regularization for variational autoencoders. In International Conference on Learning Representations, Cited by: §3.
  • L. Maaløe, M. Fraccaro, V. Liévin, and O. Winther (2019) BIVA: a very deep hierarchy of latent variables for generative modeling. arXiv preprint arXiv:1902.02102. Cited by: §3, §3.
  • G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, Cited by: Appendix E, Appendix E, Table 2, Table 3, Figure 4, §6.2.
  • [32] K. B. Petersen et al. The matrix cookbook. Cited by: §C.2.
  • A. Razavi, A. van den Oord, B. Poole, and O. Vinyals (2019) Preventing posterior collapse with delta-VAEs. In International Conference on Learning Representations, Cited by: §1, §3, §5.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §1.
  • D. J. Rezende and F. Viola (2018) Taming VAEs. arXiv preprint arXiv:1810.00597. Cited by: §3.
  • M. Rolinek, D. Zietlow, and G. Martius (2018) Variational autoencoders pursue PCA directions (by accident). arXiv preprint arXiv:1812.06775. Cited by: §3.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §1.
  • C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §1, §3, §5.
  • M. E. Tipping and C. M. Bishop (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §A.1, Appendix A, §C.2, §1, §1, §2, §4.1, §4.1.
  • J. M. Tomczak and M. Welling (2017) VAE with a VampPrior. arXiv preprint arXiv:1705.07120. Cited by: Appendix D.
  • S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei (2017) Tackling over-pruning in variational autoencoders. arXiv preprint arXiv:1706.03643. Cited by: §3.

Appendix A Stationary points of pPCA

Here we briefly summarize the analysis of [Tipping and Bishop, 1999] with some simple additional observations. We recommend that interested readers study Appendix A of Tipping and Bishop [1999] for the full details. We begin by formulating the conditions for stationary points of 𝐱ilogp(𝐱i):

𝐒𝐂-1𝐖=𝐖 (11)

Where 𝐒 denotes the sample covariance matrix (assuming we set 𝝁=𝝁MLE, which we do throughout), and 𝐂=𝐖𝐖T+σ2I (note that the dimensionality is different to 𝐌). There are three possible solutions to this equation, (1) 𝐖=𝟎, (2) 𝐂=𝐒, or (3) the more general solutions. (1) and (2) are not particularly interesting to us, so we focus herein on (3).

We can write 𝐖=𝐔𝐋𝐕T using its singular value decomposition. Substituting back into the stationary points equation, we recover the following:

𝐒𝐔𝐋=𝐔(σ2I+𝐋2)𝐋 (12)

Noting that 𝐋 is diagonal, if the jth singular value (lj) is non-zero, this gives 𝐒𝐮j=(σ2+lj2)𝐮j, where uj is the jth column of 𝐔. Thus, 𝐮j is an eigenvector of 𝐒 with eigenvalue λj=σ2+lj2. For lj=0, 𝐮j is arbitrary.

Thus, all potential solutions can be written as, 𝐖=Uq(Kq-σ2I)1/2𝐑, with singular values written as kj=σ2 or σ2+lj2 and with 𝐑 representing an arbitrary orthogonal matrix.

From this formulation, one can show that the global optimum is attained with σ2=σMLE2 and Uq and Kq chosen to match the leading singular vectors and values of 𝐒.

A.1 Stability of stationary point solutions

Consider stationary points of the form, 𝐖=𝐔q(Kq-σ2I)1/2 where 𝐔q contains arbitrary eigenvectors of 𝐒. In the original pPCA paper they show that all solutions except the leading principal components correspond to saddle points in the optimization landscape. However, this analysis depends critically on σ2 being set to the true maximum likelihood estimate. Here we repeat their analysis, considering other (fixed) values of σ2.

We consider a small perturbation to a column of 𝐖, of the form ϵ𝐮j . To analyze the stability of the perturbed solution, we check the sign of the dot-product of the perturbation with the likelihood gradient at 𝐰i+ϵ𝐮j. Ignoring terms in ϵ2 we can write the dot-product as,

ϵN(λj/ki-1)𝐮jT𝐂-1𝐮j (13)

Now, 𝐂-1 is positive definite and so the sign depends only on λj/ki-1. The stationary point is stable (local maxima) only if the sign is negative. If ki=λi then the maxima is stable only when λi>λj, in words, the top q principal components are stable. However, we must also consider the case k=σ2. Tipping and Bishop [1999] show that if σ2=σMLE2, then this also corresponds to a saddle point as σ2 is the average of the smallest eigenvalues meaning some perturbation will be unstable (except in a special case which is handled separately).

However, what happens if σ2 is not set to be the maximum likelihood estimate? In this case, it is possible that there are no unstable perturbation directions (that is, λj<σ2 for too many j). In this case when σ2 is fixed, there are local optima where 𝐖 has zero-columns — the same solutions that we observe in non-linear VAEs corresponding to posterior collapse. Note that when σ2 is learned in non-degenerate cases the local maxima presented above become saddle points where σ2 is made smaller by its gradient. In practice, we find that even when σ2 is learned in the non-linear case local maxima exist.

Appendix B Identifiability of the linear VAE

Linear autoencoders suffer from a lack of identifiability which causes the decoder columns to span the principal component subspace instead of recovering it. Kunin et al. [2019] showed that adding regularization to the linear autoencoder improves the identifiability — forcing the columns to be identified up to an arbitrary orthogonal transformation, as in pPCA. Here we show that linear VAEs are able to fully identify the principal components.

We once again consider the linear VAE from Eq. (9):

p(𝐱𝐳)=𝒩(𝐖𝐳+𝝁,σ2𝐈),q(𝐳𝐱)=𝒩(𝐕(𝐱-𝝁),𝐃),

The output of the VAE, 𝐱~ is distributed as,

𝐱~|𝐱𝒩(𝐖𝐕(𝐱-𝝁)+𝝁,𝐖𝐃𝐖T).

Therefore, the output of the linear VAE is invariant to the following transformation:

𝐖𝐖𝐀,𝐕𝐀-1𝐕,𝐃𝐀-1𝐃𝐀-1, (14)

where 𝐀 is a diagonal matrix with non-zero entries so that 𝐃 is well-defined. However, this transformation changes the variational distribution which affects the loss through the KL term. As argued in Corollary 1, this means that the global optimum is unique for ELBO up to ordering of the eigenvalues/eigenvectors.

At the global optimum, the ordering can be recovered by computing the squared Euclidean norm of the columns of 𝐖 (which correspond to the singular values) and ordering according to these quantities. In other words, 𝐑 is a permutation matrix which can be computed exactly.

Appendix C Stationary points of ELBO

Here we present details on the analysis of the stationary points of the ELBO objective. To begin, we first derive closed-form solutions to the components of the log marginal likelihood (including the ELBO). The VAE we focus on is the one presented in Eq. (9), with a linear encoder, linear decoder, Gaussian prior, and Gaussian observation model.

C.1 Analytic ELBO of the Linear VAE

Remember that one can express the log marginal likelihood as:

logp(𝐱)=KL(q(𝐳|𝐱)||p(𝐳|𝐱))(A)-KL(q(𝐳|𝐱)||p(𝐳))(B)+𝔼q(𝐳|𝐱)[logp(𝐱|𝐳)](C). (15)

Each of the terms (A-C) can be expressed in closed form for the linear VAE. Note that the KL term (A) is minimized when the variational distribution is exactly the true posterior distribution. This is possible when the columns of the decoder are orthogonal.

The term (B) can be expressed as,

KL(q(𝐳|𝐱)||p(z))=0.5(-logdet𝐃+(𝐱-𝝁)T𝐕T𝐕(𝐱-𝝁)+tr(𝐃)-q). (16)

The term (C) can be expressed as,

𝔼q(𝐳|𝐱)[logp(𝐱|𝐳)] =𝔼q(𝐳|𝐱)[-(𝐖𝐳-(𝐱-𝝁))T(𝐖𝐳-(𝐱-𝝁))/2σ2-d2log2πσ2] (17)
=𝔼q(𝐳|𝐱)[-(𝐖𝐳)T(𝐖𝐳)+2(𝐱-𝝁)T𝐖𝐳-(𝐱-𝝁)T(𝐱-𝝁)2σ2-d2log2πσ2]. (18)

Noting that 𝐖𝐳𝒩(𝐖𝐕(𝐱-𝝁),𝐖𝐃𝐖T), we can compute the expectation analytically and obtain,

𝔼q(𝐳|𝐱)[logp(𝐱|𝐳)] =12σ2[-tr(𝐖𝐃𝐖T)-(𝐱-𝝁)T𝐕T𝐖T𝐖𝐕(𝐱-𝝁) (19)
+2(𝐱-𝝁)T𝐖𝐕(𝐱-𝝁)-(𝐱-𝝁)T(𝐱-𝝁)]-d2log2πσ2. (20)

C.2 Finding stationary points

To compute the stationary points we must take derivatives with respect to 𝝁,𝐃,𝐖,𝐕,σ2. As before, we have 𝝁=𝝁MLE at the global maximum and for simplicity we fix 𝝁 here for the remainder of the analysis.

Taking the marginal likelihood over the whole dataset, at the stationary points we have,

𝐃(-(B)+(C)) =N2(𝐃-1-𝐈-1σ2diag(𝐖T𝐖))=0 (21)
𝐕(-(B)+(C)) =Nσ2(𝐖T-(𝐖T𝐖+σ2𝐈)𝐕)𝐒=0 (22)
𝐖(-(B)+(C)) =Nσ2(𝐒𝐕T-𝐃𝐖-𝐖𝐕𝐒𝐕T)=0 (23)

The above are computed using standard matrix derivative identities [Petersen and others, ]. These equations yield the expected solution for the variational distribution directly. From Eq. (21) we compute 𝐃*=σ2(diag(𝐖T𝐖)+σ2𝐈)-1 and 𝐕*=𝐌-1𝐖T, recovering the true posterior mean in all cases and getting the correct posterior covariance when the columns of 𝐖 are orthogonal. We will now proceed with the proof of Theorem 1.

See 1

Proof.

If the columns of 𝐖 are orthogonal then the log marginal likelihood is recovered exactly at all stationary points. This is a direct consequence of the posterior mean and covariance being recovered exactly at all stationary points so that (1) is zero.

We must give separate treatment to the case where there is a stationary point without orthogonal columns of 𝐖. Suppose we have such a stationary point, using the singular value decomposition we can write 𝐖=𝐔𝐋𝐑T, where 𝐔 and 𝐑 are orthogonal matrices. Note that logp(𝐱) is invariant to the choice of 𝐑 [Tipping and Bishop, 1999]. However, the choice of 𝐑 does affect the first term (1) of Eq. (15): this term is minimized when 𝐑=𝐈, and thus the ELBO must increase.

To formalize this argument, we compute (1) at a stationary point. From above, at every stationary point the mean of the variational distribution exactly matches the true posterior. Thus the KL simplifies to:

KL(q(𝐳|𝐱)||p(𝐳|𝐱)) =12(tr(1σ2𝐌𝐃)-q+qlogσ2-log(det𝐌det𝐃)), (24)
=12(tr(𝐌𝐌~-1)-q-logdet𝐌det𝐌~), (25)
=12(i=1q𝐌ii𝐌ii-q-logdet𝐌+logdet𝐌~), (26)
=12(logdet𝐌~-logdet𝐌), (27)

where 𝐌~=diag(𝐖T𝐖)+σ2𝐈. Now consider applying a small rotation to 𝐖: 𝐖𝐖𝐑ϵ. As the optimal 𝐃 and 𝐕 are continuous functions of 𝐖, this corresponds to a small perturbation of these parameters too for a sufficiently small rotation. Importantly, logdet𝐌 remains fixed for any orthogonal choice of 𝐑ϵ but logdet𝐌~ does not. Thus, we choose 𝐑ϵ to minimize this term. In this manner, (1) shrinks meaning that the ELBO (-2)+(3) must increase. Thus if the stationary point existed, it must have been a saddle point.

We now describe how to construct such a small rotation matrix. First note that without loss of generality we can assume that det(𝐑)=1. (Otherwise, we can flip the sign of a column of 𝐑 and the corresponding column of 𝐔.) And additionally, we have 𝐖𝐑=𝐔𝐋, which is orthogonal.

The Special Orthogonal group of determinant 1 orthogonal matrices is a compact, connected Lie group and therefore the exponential map from its Lie algebra is surjective. This means that we can find an upper-triangular matrix 𝐁, such that 𝐑=exp{𝐁-𝐁T}. Consider 𝐑ϵ=exp{1n(ϵ)(𝐁-𝐁T)}, where n(ϵ) is an integer chosen to ensure that the elements of 𝐁 are within ϵ>0 of zero. This matrix is a rotation in the direction of 𝐑 which we can make arbitrarily close to the identity by a suitable choice of ϵ. This is verified through the Taylor series expansion of 𝐑ϵ=I+1n(ϵ)(𝐁-𝐁T)+O(ϵ2). Thus, we have identified a small perturbation to 𝐖 (and 𝐃 and 𝐕) which decreases the posterior KL (A) but keeps the log marginal likelihood constant. Thus, the ELBO increases and the stationary point must be a saddle point. ∎

C.3 Bernoulli Probabilistic PCA

We would like to extend our linear analysis to the case where we have a Bernoulli observation model, as this setting also suffers severely from posterior collapse. The analysis may also shed light on more general categorical observation models which have also been used. Typically, in these settings a continuous latent space is still used (for example, Bowman et al. [2015]).

We will consider the following model,

p(𝐳)=𝒩(0,𝐈),p(𝐱|𝐳)=Bernoulli(𝐲),𝐲=σ(𝐖𝐳+𝝁) (29)

where σ denotes the sigmoid function, σ(y)=1/(1+exp(-y)) and we assume an independent Bernoulli observation model over 𝐱.

Unfortunately, under this model it is difficult to reason about the stationary points. There is no closed form solution for the marginal likelihood p(𝐱) or the posterior distribution p(𝐳|𝐱). Numerical integration methods exist which may make it easy to evaluate this quantity in practice but they will not immediately provide us a good gradient signal.

We can compute the density function for 𝐲 using the change of variables formula. Noting that 𝐖𝐳+𝝁𝒩(𝝁,𝐖𝐖T), we recover the following logit-Normal distribution:

f(𝐲)=12π|𝐖𝐖T|1Πiyi(1-yi)exp{-12(log(𝐲1-𝐲)-𝝁)T(𝐖𝐖T)-1(log(𝐲1-𝐲)-𝝁)} (30)

We can write the marginal likelihood as,

p(𝐱) =p(𝐱|𝐳)p(𝐳)𝑑𝐳, (31)
=𝔼𝐳[𝐲(𝐳)𝐱(1-𝐲(𝐳))1-𝐱], (32)

where ()𝐱 is taken to be elementwise. Unfortunately, the expectation of a logit-normal distribution has no closed form [Atchison and Shen, 1980] and so we cannot tractably compute the marginal likelihood.

Similarly, under ELBO we need to compute the expected reconstruction error. This can be written as,

𝔼q(𝐳|𝐱)[logp(𝐱|𝐳)]=𝐲(𝐳)𝐱(1-𝐲(𝐳))1-𝐱𝒩(𝐳;𝐕(𝐱-𝝁),𝐃)𝑑𝐳, (33)

another intractable integral.

Appendix D Related Work (Extended)

Due to the large volume of work studying posterior collapse in variational autoencoders, we have included here an extended discussion of related work. We utilize this additional space to provide a more in-depth discussion of the related work presented in the main paper and to highlight additional work.

Tomczak and Welling [2017] introduce the VampPrior, a hierarchical learned prior for VAEs. Tomczak and Welling [2017] show empirically that such a learned prior can mitigate posterior collapse (which they refer to as inactive stochastic units). While the authors provide limited theoretical support for the efficacy of their method in reducing posterior collapse, they claim intuitively that by enabling multi-modal prior distributions the KL term is less likely to force inactive units — possibly by reducing the impact of local optima corresponding to posterior collapse.

In the main paper we discuss the work of Dai et al. [2017], which connect robust PCA methods and VAEs. In particular, Section 2 of their manuscript studies the case of a linear decoder and shows that, when the encoder takes the form of the optimal variational distribution, the ELBO of the resulting VAE collapses into the pPCA objective. We study the ELBO without optimality assumptions on the linear encoder and characterize the optimization landscape with no additional assumptions. They claim further that all minima of the (encoder-optimal) ELBO objective are globally optimal — we show in fact that for a linear encoder there is a fully identifiable global optimum.

Dai and Wipf [2019] discuss the important of the observation noise, and in fact show that under some assumptions the optimal observation noise should shrink to zero (Theorem 4 in their work). These assumptions amount to the number of latent dimensions exceeding the dimensionality of the true data manifold. However, in the linear model (whose latent dimensions do not exceed the input space dimensionality) the optimal variance does not shrink towards zero and is instead given by the sum of the variance lost in the linear projection. Note that this does not violate the results of Dai and Wipf [2019], but highlights the need to consider model capacity against data complexity, as in Alemi et al. [2017].

Appendix E Experiment details

We used Tensorflow [Abadi et al., 2015] for our experiments with linear and deep VAEs. In each case, the models were trained using a single GPU.

Visualizing stationary points of pPCA

For this experiment we computed the pPCA MLE using a subset of 1000 random training images from the MNIST dataset. We evaluate and plot the log marginal likelihood in closed form on this same subset. In this case, we did not dequantize or apply any nonlinear processing to the data.

Stochastic vs. Analytic VAE

We trained linear VAEs with 200 hidden dimensions. We used full-batch training with 1000 MNIST digits samples randomly from the training set (the same data as used to produce Figure 2). We trained each model with the Adam optimizer and a fixed learning rate, grid searching to find the learning rate which gave the best ELBO after 12000 training steps in the range {0.0001,0.0003,0.001,0.003}. For both models, 0.001 provided the best final ELBO.

MNIST VAE

The VAEs we trained on MNIST all had the same architecture: 784-1024-512-k-512-1024-784. The Gaussian likelihood is fairly uncommon for this dataset, which is nearly binary, but it provides a good setting for us to investigate our theoretical findings. To dequantize the data, we added uniform random noise and rescaled the pixel values to be in the range [0,1]. We then applied a nonlinear logistic transform as in [Papamakarios et al., 2017]. The VAE parameters were optimized jointly using the Adam optimizer [Kingma and Ba, 2014]. We trained the VAE for 1000 epochs total, keeping the learning rate fixed throughout. We performed a grid search over learning rates in the range {0.0001,0.0003,0.001,0.003} and reported results for the model which achieved the best training ELBO.

CelebA VAE

We used the convolutional architecture proposed by Higgins et al. [2016] trained on 64x64 images from the CelebA dataset [Liu et al., 2015]. Otherwise, the experimental procedure followed that of the MNIST VAEs with the nonlinear preprocessing hyperparameters set as in [Papamakarios et al., 2017].

E.1 Additional results

E.1.1 Evaluating KL Annealing

We found that KL-annealing may provide temporary relief from posterior collapse but that if σ2 is not learned simultaneously then the collapsed solution is recovered. In Figure 7 we show the proportion of units collapsed by threshold for several fixed choices of σ2 when β is annealed from 0 to 1 over the first 100 epochs. The solid lines correspond to the final model while the dashed line corresponds to the model at 80 epochs of training. KL-annealing was able to reduce posterior collapse initially but eventually fell back to the collapsed solution.

Figure 7: Proportion of inactive units thresholded by KL divergence when using 0-1 KL-annealing and a fixed value of σ2. The solid line represents the final model while the dashed line is the model after only 80 epochs of training. KL annealing reduces posterior collapse during the early stages of training but ultimately fails to escape these sub-optimal solutions as the KL weight is increased.
Figure 8: Comparing learned solutions using KL-Annealing versus standard ELBO training when σ2 is learned.

After finding that KL-annealing alone was insufficient to prevent posterior collapse we explored KL annealing while learning σ2. Based on our analysis in the linear case we expect that this should work well: while β is small the model should be able to learn to reduce σ2. We trained using the same KL schedule and also with standard ELBO while learning σ2. The results are presented in Figure 8 and Figure 9. Under the ELBO objective, σ2 is reduced somewhat but ultimately a large degree of posterior collapse is present. Using KL-annealing, the VAE is able to learn a much smaller σ2 value and ultimately reduces posterior collapse. This suggests that the non-linear VAE dynamics may be similar to the linear case when suitably conditioned.

Figure 9: Learning σ2 for CelebA VAEs with standard ELBO training and KL-Annealing. KL-Annealing enables a smaller σ2 to be learned and reduces posterior collapse.

E.1.2 Full results tables

Model ELBO σ2-tuned ELBO Tuned σ2 Posterior KL
Init σ2 Final σ2 collapse (%) Divergence

MNIST

30.0 -1850.4±29.0 -1374.9±199.0 4.451 95.00 10.9±6.7
10.0 -1450.3±4.2 -1098.2±28.3 1.797 89.88 28.8±1.4
3.0 -1114.9±1.1 -1018.8±1.0 1.361 76.75 58.5±1.4
1.0 -1022.1±5.4 -1018.3±5.3 1.145 27.38 125.4±4.2
0.3 -1816.7±270.6 -1104.6±6.2 1.275 2.00 179.3±85.9
0.1 -3697.3±493.3 -1190.8±37.4 0.968 3.25 368.7±94.6
0.03 -18549.3±4892.0 -1283.2±63.3 1.470 0.00 305.3±75.4
0.01 -38612.5±1189.8 -1403.1±21.0 1.006 0.00 560.9±32.4
0.003 -139538.8±21148.5 -2090.8±975.1 0.877 0.00 695.9±118.1
0.001 -504259.1±49149.8 -1744.7±48.4 0.810 0.00 756.2±12.6
30.0 1.478 -1060.9±23.1 -1061.0±23.0 1.476 33.75 70.9±13.8
10.0 1.32 -1022.2±4.5 -1022.3±4.6 1.318 73.75 73.8±9.8
3.0 1.178 -1004.6±1.4 -1004.5±1.3 1.181 58.38 99.8±1.5
1.0 1.183 -1011.1±2.7 -1011.1±2.8 1.182 47.88 106.3±2.5
0.3 1.195 -1020.0±6.0 -1019.9±6.1 1.191 37.75 111.6±6.1
0.1 1.194 -1025.4±8.6 -1025.4±8.6 1.195 29.25 116.1±11.4
0.03 1.197 -1030.6±6.6 -1030.5±6.6 1.198 22.62 120.2±10.5
0.01 1.194 -1030.6±3.5 -1030.5±3.5 1.191 23.00 121.9±7.7
0.003 1.19 -1033.7±2.3 -1033.6±2.3 1.187 16.62 126.4±6.8
0.001 1.208 -1038.7±5.6 -1038.8±5.6 1.209 27.00 124.9±1.6
Table 2: Full evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued MNIST with nonlinear preprocessing [Papamakarios et al., 2017]. Collapse percent gives the percentage of latent dimensions which are within 0.01 KL of the prior for at least 99% of the encoder inputs.
Model ELBO σ2-tuned ELBO Tuned σ2 Posterior KL
Init σ2 Final σ2 collapse (%) Divergence

CELEBA 64

30.0 -79986.2±0.10 -57883.8±19.3 0.423 93.68 26.0±0.2
10.0 -73328.4±0.49 -55186.7±35.1 0.204 80.56 56.12±0.4
3.0 -66145.6±2.44 -52828.5±58.6 0.132 20.64 120.4±1.4
1.0 -59841.8±30.1 -51294.8±333.7 0.102 2.52 213.4±6.3
0.3 -54370.4±849.9 -52155.2±1855.2 0.122 74.52 267.2±51.9
0.1 -50760.3±353.4 -50698.5±393.9 0.0883 32.72 483.8±36.2
0.03 -64322.8±312.9 -58077.9±206.2 0.0463 0.00 1521.1±11.6
0.01 -82478.7±1823.3 -51373.9±213.3 0.0817 0.00 1624.2±8.8
0.003 -192967.7±4410.4 -51978.4±159.3 0.0685 0.00 2108.4±26.2
0.001 -531924.5±17177.6 -57381.5±512.6 0.0296 0.00 2680.2±41.5
30.0 0.478 -57773.0±3622.9 -56068.5±2771.0 0.475 14.20 221.7±99.0
10.0 0.0962 -51109.5±408.2 -51109.5±408.3 0.0963 53.32 364.5±26.4
3.0 0.0891 -50813.2±229.7 -50813.3±229.7 0.0889 10.96 545.2±5.5
1.0 0.0875 -50631.2±163.4 -50631.0±163.3 0.0875 54.76 462.2±20.0
0.3 0.0890 -50963.4±331.2 -50963.2±331.3 0.0892 7.96 670.7±79.2
0.1 0.0863 -50646.9±269.0 -50645.9±267.5 0.0869 28.84 520.9±11.7
0.03 0.121 -53263.4±71.5 -53263.3±71.3 0.126 0.00 856.2±19.7
0.01 0.0911 -51285.0±708.1 -51284.8±708.1 0.0963 5.64 557.0±50.5
0.003 0.0952 -51056.4±1216.9 -51055.9±1217.4 0.094 0.80 577.4±30.4
0.001 0.104 -51695.1±322.4 -51694.8±322.7 0.0974 0.00 537.5±46.2
Table 3: Full evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued CelebA with nonlinear preprocessing [Papamakarios et al., 2017]. Collapse percent gives the percentage of latent dimensions which are within 0.01 KL of the prior for at least 99% of the encoder inputs.
Model ELBO σ2-tuned ELBO Tuned σ2 Posterior KL
Init σ2 Final σ2 collapse (%) Divergence

MNIST

30.0 -6402.0±0.0 -6248.4±197.2 22.323 0.00 0.0±0.0
10.0 -5973.1±0.0 -5821.0±194.6 7.443 0.00 0.0±0.0
3.0 -5507.1±0.1 -5360.4±185.4 2.235 1.70 0.6±0.3
1.0 -5087.9±3.1 -4954.7±156.9 0.747 0.00 4.5±2.3
0.3 -4638.4±3.6 -4516.8±137.9 0.225 0.00 12.5±1.5
0.1 -4243.1±17.6 -4154.6±62.1 0.076 0.00 25.6±3.0
0.03 -3820.7±13.9 -3785.2±26.6 0.027 0.00 55.8±2.1
0.01 -3508.4±12.3 -3483.5±13.1 0.009 0.00 112.8±6.7
0.003 -3267.3±2.6 -3247.1±2.8 0.003 0.00 252.2±2.1
0.001 -3137.7±5.2 -3136.7±5.4 0.001 0.00 422.7±2.6
30.0 0.067 -4398.7±0.0 -4398.7±0.0 0.067 0.00 0.0±0.0
10.0 0.044 -4146.3±309.2 -4146.3±309.2 0.044 0.00 30.1±36.9
3.0 0.01 -3736.3±14.3 -3736.4±14.3 0.010 0.00 73.7±1.9
1.0 0.008 -3673.0±17.7 -3672.9±17.7 0.008 0.00 85.2±2.5
0.3 0.006 -3569.8±26.4 -3569.8±26.4 0.006 0.00 100.8±3.7
0.1 0.003 -3355.8±7.6 -3355.8±7.6 0.003 0.00 151.7±2.4
0.03 0.001 -3138.9±10.6 -3139.0±10.6 0.001 0.00 275.4±3.1
0.01 0.001 -3126.1±5.0 -3126.1±5.0 0.001 0.00 349.3±5.4
0.003 0.001 -3161.4±4.0 -3161.3±4.0 0.001 0.00 373.5±7.5
0.001 0.001 -3145.4±6.1 -3145.4±6.1 0.001 0.00 378.4±7.7
Table 4: Evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued MNIST without any nonlinear preprocessing. Collapse percent gives the percentage of latent dimensions which are within 0.01 KL of the prior for at least 99% of the encoder inputs.
Model ELBO σ2-tuned ELBO Tuned σ2 Posterior KL
Init σ2 Final σ2 collapse (%) Divergence

CELEBA 64

30.0 -79986.2±0.10 -57883.8±19.3 0.423 93.68 26.0±0.19
10.0 -73328.4±0.49 -55186.7±35.1 0.204 80.56 56.12±0.42
3.0 -66145.6±2.44 -52828.5±58.6 0.132 20.64 120.4±1.37
1.0 -59841.8±30.1 -51294.8±333.7 0.102 2.52 213.4±6.3
0.3 -54370.4±849.9 -52155.2±1855.2 0.122 74.52 267.2±51.9
0.1 -50760.3±353.4 -50698.5±393.9 0.0883 32.72 483.8±36.2
0.03 -64322.8±312.9 -58077.9±206.2 0.0463 0.00 1521.1±11.6
0.01 -82478.7±1823.3 -51373.9±213.3 0.0817 0.00 1624.2±8.78
0.003 -192967.7±4410.4 -51978.4±159.3 0.0685 0.00 2108.4±26.2
0.001 -531924.5±17177.6 -57381.5±512.6 0.0296 0.00 2680.2±41.45
30.0 0.005 -53179.6±450.2 -53179.6±450.3 0.005 0.00 302.8±29.8
10.0 0.004 -51748.5±178.2 -51748.5±178.2 0.004 0.00 482.3±24.7
3.0 0.004 -51548.9±154.1 -51548.9±154.2 0.004 0.00 489.5±21.8
1.0 0.004 -51356.9±79.1 -51356.9±79.1 0.004 0.00 516.3±18.0
0.3 0.004 -51767.7±369.2 -51767.7±369.1 0.004 22.00 439.7±33.3
0.1 0.004 -51637.3±163.3 -51637.1±163.5 0.004 0.00 577.3±13.5
0.03 0.004 -51792.6±163.4 -51792.6±163.6 0.004 45.48 484.6±22.6
0.01 0.004 -51925.1±99.8 -51924.9±99.8 0.004 0.00 627.8±20.6
0.003 0.004 -52111.2±149.0 -52111.0±148.8 0.004 42.80 466.9±13.9
0.001 0.004 -52060.1±171.8 -52060.0±171.9 0.004 0.0 645.6±19.2
Table 5: Evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued CelebA without any nonlinear preprocessing. Collapse percent gives the percentage of latent dimensions which are within 0.01 KL of the prior for at least 99% of the encoder inputs.
Figure 10: Posterior collapse percentage as a function of ϵ-threshold for a deep VAE trained on CelebA with fixed σ2. We measure posterior collapse for trained networks as the proportion of latent dimensions that are within ϵ KL divergence of the prior for at least a 1-δ proportion of the training data points (δ=0.01 in the plots).
Figure 11: Posterior collapse percentage as a function of ϵ-threshold for a deep VAE trained on CelebA with learned σ2. We measure posterior collapse for trained networks as the proportion of latent dimensions that are within ϵ KL divergence of the prior for at least a 1-δ proportion of the training data points (δ=0.01 in the plots).

E.1.3 Qualitative Results

Reconstructions from the KL-Annealed CelebA model are shown in Figure 12. We also show the output of interpolating in the latent space in Figure 13. To produce the latter plot, we compute the variational mean of 3 input points (top left, top right, bottom left) and interpolate linearly on the plane between them. We also extrapolate out to a fourth point (bottom right), which lies on the plane defined by the other points.

Figure 12: Reconstructions from the convolutional VAE trained with KL-Annealing on CelebA.
Figure 13: Latent space interpolations from the convolutional VAE trained with KL-Annealing on CelebA.