Abstract
Posterior collapse in Variational Autoencoders (VAEs) arises when thevariational posterior distribution closely matches the prior for a subset oflatent variables. This paper presents a simple and intuitive explanation forposterior collapse through the analysis of linear VAEs and their directcorrespondence with Probabilistic PCA (pPCA). We explain how posterior collapsemay occur in pPCA due to local maxima in the log marginal likelihood.Unexpectedly, we prove that the ELBO objective for the linear VAE does notintroduce additional spurious local maxima relative to log marginal likelihood.We show further that training a linear VAE with exact variational inferencerecovers an identifiable global maximum corresponding to the principalcomponent directions. Empirically, we find that our linear analysis ispredictive even for highcapacity, nonlinear VAEs and helps explain therelationship between the observation noise, local maxima, and posteriorcollapse in deep Gaussian VAEs.
Quick Read (beta)
Don’t Blame the ELBO!
A Linear VAE Perspective on Posterior Collapse
Abstract
Posterior collapse in Variational Autoencoders (VAEs) arises when the variational posterior distribution closely matches the prior for a subset of latent variables. This paper presents a simple and intuitive explanation for posterior collapse through the analysis of linear VAEs and their direct correspondence with Probabilistic PCA (pPCA). We explain how posterior collapse may occur in pPCA due to local maxima in the log marginal likelihood. Unexpectedly, we prove that the ELBO objective for the linear VAE does not introduce additional spurious local maxima relative to log marginal likelihood. We show further that training a linear VAE with exact variational inference recovers an identifiable global maximum corresponding to the principal component directions. Empirically, we find that our linear analysis is predictive even for highcapacity, nonlinear VAEs and helps explain the relationship between the observation noise, local maxima, and posterior collapse in deep Gaussian VAEs.
capbtabboxtable[][\FBwidth]
Don’t Blame the ELBO!
A Linear VAE Perspective on Posterior Collapse
James Lucas${}^{\mathrm{\u2021}}$^{†}^{†}thanks: Intern at Google Brain, George Tucker${}^{\mathrm{\u2020}}$, Roger Grosse${}^{\mathrm{\u2021}}$, Mohammad Norouzi${}^{\mathrm{\u2020}}$ $\u2021$University of Toronto $\u2020$Google Brain
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
The generative process of a deep latent variable model entails drawing a number of latent factors from the prior and using a neural network to convert such factors to real data points. Maximum likelihood estimation of the parameters requires marginalizing out the latent factors, which is intractable for deep latent variable models. The influential work of Kingma and Welling (2013) and Rezende et al. (2014) on Variational Autoencoders (VAEs) enables optimization of a tractable lower bound on the likelihood via a reparameterization of the Evidence Lower Bound (ELBO) (Jordan et al., 1999; Blei et al., 2017). This has led to a surge of recent interest in automatic discovery of the latent factors of variation for a data distribution based on VAEs and principled probabilistic modeling (Higgins et al., 2016; Bowman et al., 2015; Chen et al., 2018; GomezBombarelli et al., 2018). ^{†}^{†} Code available at https://sites.google.com/view/dontblametheelbo
Unfortunately, the quality and the number of the latent factors learned is influenced by a phenomenon known as posterior collapse, where the generative model learns to ignore a subset of the latent variables. Most existing papers suggest that posterior collapse is caused by the KLdivergence term in the ELBO objective, which directly encourages the variational distribution to match the prior (Bowman et al., 2015; Kingma et al., 2016; Sønderby et al., 2016). Thus, a wide range of heuristic approaches in the literature have attempted to diminish the effect of the KL term in the ELBO to alleviate posterior collapse (Bowman et al., 2015; Razavi et al., 2019; Sønderby et al., 2016; Huang et al., 2018). While holding the KL term responsible for posterior collapse makes intuitive sense, the mathematical mechanism of this phenomenon is not well understood. In this paper, we investigate the connection between posterior collapse and spurious local maxima in the ELBO objective through the analysis of linear VAEs. Unexpectedly, we show that spurious local maxima may arise even in the optimization of exact marginal likelihood, and such local maxima are linked with a collapsed posterior.
While linear autoencoders (Rumelhart et al., 1985) have been studied extensively (Baldi and Hornik, 1989; Kunin et al., 2019), little attention has been given to their variational counterpart from a theoretical standpoint. A wellknown relationship exists between linear autoencoders and PCA – the optimal solution of a linear autoencoder has decoder weight columns that span the same subspace as the one defined by the principal components (Baldi and Hornik, 1989). Similarly, the maximum likelihood solution of probabilistic PCA (pPCA) (Tipping and Bishop, 1999) recovers the subspace of principal components. In this work, we show that a linear variational autoencoder can recover the solution of pPCA. In particular, by specifying a diagonal covariance structure on the variational distribution, one can recover an identifiable autoencoder, which at the global maximum of the ELBO recovers the exact principal components as the columns of the decoder’s weights. Importantly, we show that the ELBO objective for a linear VAE does not introduce any local maxima beyond the log marginal likelihood.
The study of linear VAEs gives us new insights into the cause of posterior collapse and the difficulty of VAE optimization more generally. Following the analysis of Tipping and Bishop (1999), we characterize the stationary points of pPCA and show that the variance of the observation model directly influences the stability of local stationary points corresponding to posterior collapse – it is only possible to escape these suboptimal solutions by simultaneously reducing noise and learning better features. Our contributions include:

•
We verify that linear VAEs can recover the true posterior of pPCA. Further, we prove that the global optimum of the linear VAE recovers the principal components (not just their spanning subspace). More importantly, we prove that using ELBO to train linear VAEs does not introduce any additional spurious local maxima relative to log marginal likelihood training.

•
While highcapacity decoders are often blamed for posterior collapse, we show that posterior collapse may occur when optimizing log marginal likelihood even without powerful decoders. Our experiments verify the analysis of the linear setting and show that these insights extend even to highcapacity nonlinear VAEs. Specifically, we provide evidence that the observation noise in deep Gaussian VAEs plays a crucial role in overcoming local maxima corresponding to posterior collapse.
2 Preliminaries
Probabilistic PCA.
The probabilitic PCA (pPCA) model is defined as follows. Suppose latent variables $\mathbf{z}\in {\mathbb{R}}^{k}$ generate data $\mathbf{x}\in {\mathbb{R}}^{n}$. A standard Gaussian prior is used for $\mathbf{z}$ and a linear generative model with a spherical Gaussian observation model for $\mathbf{x}$:
$\begin{array}{cc}\hfill p(\mathbf{z})& =\mathcal{N}(\mathrm{\U0001d7ce},\mathbf{I}),\hfill \\ \hfill p(\mathbf{x}\mid \mathbf{z})& =\mathcal{N}(\mathrm{\mathbf{W}\mathbf{z}}+\bm{\mu},{\sigma}^{2}\mathbf{I}).\hfill \end{array}$  (1) 
The pPCA model is a special case of factor analysis (Bartholomew, 1987), which uses a spherical covariance ${\sigma}^{2}\mathbf{I}$ instead of a full covariance matrix. As pPCA is fully Gaussian, both the marginal distribution for $\mathbf{x}$ and the posterior $p(\mathbf{z}\mid \mathbf{x})$ are Gaussian, and unlike factor analysis, the maximum likelihood estimates of $\mathbf{W}$ and ${\sigma}^{2}$ are tractable (Tipping and Bishop, 1999).
Variational Autoencoders.
Recently, amortized variational inference has gained popularity as a means to learn complicated latent variable models. In these models, the log marginal likelihood, $\mathrm{log}p(\mathbf{x})$, is intractable but a variational distribution, denoted $q(\mathbf{z}\mid \mathbf{x})$, is used to approximate the posterior $p(\mathbf{z}\mid \mathbf{x})$, allowing tractable approximate inference using the Evidence Lower Bound (ELBO):
$\mathrm{log}p(\mathbf{x})$  $=$  ${\mathbb{E}}_{q(\mathbf{z}\mid \mathbf{x})}[\mathrm{log}p(\mathbf{x},\mathbf{z})\mathrm{log}q(\mathbf{z}\mid \mathbf{x})]+{D}_{KL}(q(\mathbf{z}\mid \mathbf{x})p(\mathbf{z}\mid \mathbf{x}))$  (2)  
$\ge $  ${\mathbb{E}}_{q(\mathbf{z}\mid \mathbf{x})}[\mathrm{log}p(\mathbf{x},\mathbf{z})\mathrm{log}q(\mathbf{z}\mid \mathbf{x})]$  (3)  
$=$  ${\mathbb{E}}_{q(\mathbf{z}\mid \mathbf{x})}[\mathrm{log}p(\mathbf{x}\mid \mathbf{z})]{D}_{KL}(q(\mathbf{z}\mid \mathbf{x})p(\mathbf{z}))\mathit{\hspace{1em}\hspace{1em}}(:=ELBO)$  (4) 
The ELBO (Jordan et al., 1999; Blei et al., 2017) consists of two terms, the KL divergence between the variational distribution, $q(\mathbf{z}\mathbf{x})$, and prior, $p(\mathbf{z})$, and the expected conditional loglikelihood. The KL divergence forces the variational distribution towards the prior and so has reasonably been the focus of many attempts to alleviate posterior collapse. We hypothesize that the log marginal likelihood itself often encourages posterior collapse.
In Variational Autoencoders (VAEs), two neural networks are used to parameterize ${q}_{\varphi}(\mathbf{z}\mathbf{x})$ and ${p}_{\theta}(\mathbf{x}\mathbf{z})$, where $\varphi $ and $\theta $ denote two sets of neural network weights. The encoder maps an input $\mathbf{x}$ to the parameters of the variational distribution, and then the decoder maps a sample from the variational distribution back to the inputs.
Posterior collapse.
A dominant issue with VAE optimization is posterior collapse, in which the learned variational distribution is close to the prior. This reduces the capacity of the generative model, making it impossible for the decoder network to make use of the information content of all of the latent dimensions. While posterior collapse is widely acknowledged, formally defining it has remained a challenge. We introduce a formal definition in Section 6.2 which we use to measure posterior collapse in trained deep neural networks.
3 Related Work
Dai et al. (2017) discuss the relationship between robust PCA methods (Candès et al., 2011) and VAEs. They show that at stationary points the VAE objective locally aligns with pPCA under certain assumptions. We study the pPCA objective explicitly and show a direct correspondence with linear VAEs. Dai et al. (2017) showed that the covariance structure of the variational distribution may smooth out the loss landscape. This is an interesting result whose interactions with ours is an exciting direction for future research.
He et al. (2019) motivate posterior collapse through an investigation of the learning dynamics of deep VAEs. They suggest that posterior collapse is caused by the inference network lagging behind the true posterior during the early stages of training. A related line of research studies issues arising from approximate inference causing a mismatch between the variational distribution and true posterior (Cremer et al., 2018; Kim et al., 2018; Hjelm et al., 2016). By contrast, we show that posterior collapse may exist even when the variational distribution matches the true posterior exactly.
Alemi et al. (2017) used an information theoretic framework to study the representational properties of VAEs. They show that with infinite model capacity there are solutions with equal ELBO and log marginal likelihood which span a range of representations, including posterior collapse. We find that even with weak (linear) decoders, posterior collapse may occur. Moreover, we show that in the linear case this posterior collapse is due entirely to the log marginal likelihood.
The most common approach for dealing with posterior collapse is to anneal a weight on the KL term during training from $0$ to $1$ (Bowman et al., 2015; Sønderby et al., 2016; Maaløe et al., 2019; Higgins et al., 2016; Huang et al., 2018). Unfortunately, this means that during the annealing process, one is no longer optimizing a bound on the loglikelihood. Also, it is difficult to design these annealing schedules and we have found that once regular ELBO training resumes the posterior will typically collapse again (Section 6.2).
Kingma et al. (2016) propose a constraint on the KL term, termed "freebits", where the gradient of the KL term per dimension is ignored if the KL is below a given threshold. Unfortunately, this method reportedly has some negative effects on training stability (Razavi et al., 2019; Chen et al., 2016). DeltaVAEs (Razavi et al., 2019) instead choose prior and variational distributions such that the variational distribution can never exactly recover the prior, allocating freebits implicitly. Several other papers have studied alternative formulations of the VAE objective (Rezende and Viola, 2018; Dai and Wipf, 2019; Alemi et al., 2017; Ma et al., 2019; Yeung et al., 2017). Dai and Wipf (2019) analyzed the VAE objective to improve image fidelity under Gaussian observation models and also discuss the importance of the observation noise. Other approaches have explored changing the VAE network architecture to help alleviate posterior collapse; for example adding skip connections (Maaløe et al., 2019; Dieng et al., 2018)
Rolinek et al. (2018) observed that the diagonal covariance used in the variational distribution of VAEs encourages orthogonal representations. They use linearizations of deep networks to prove their results under a modification of the objective function by explicitly ignoring latent dimensions with posterior collapse. Our formulation is distinct in focusing on linear VAEs without modifying the objective function and proving an exact correspondence between the global solution of linear VAEs and the principal components.
Kunin et al. (2019) studied the optimization challenges in the linear autoencoder setting. They exposed an equivalence between pPCA and Bayesian autoencoders and point out that when ${\sigma}^{2}$ is too large information about the latent code is lost. A similar phenomenon is discussed in the supervised learning setting by Chechik et al. (2005). Kunin et al. (2019) also showed that suitable regularization allows the linear autoencoder to recover the principal components up to rotations. We show that linear VAEs with a diagonal covariance structure recover the principal components exactly.
4 Analysis of linear VAE
This section compares and analyzes the loss landscapes of both pPCA and linear variational autoencoders. We first discuss the stationary points of pPCA and then show that a simple linear VAE can recover the global optimum of pPCA. Moreover, when the data covariance eigenvalues are distinct, the linear VAE identifies the individual principal components, unlike pPCA, which recovers only the PCA subspace. Finally, we prove that ELBO does not introduce any additional spurious maxima to the loss landscape.
4.1 Probabilistic PCA Revisited
The pPCA model (Eq. (1)) is a fully Gaussian linear model, thus we can compute both the marginal distribution for $\mathbf{x}$ and the posterior $p(\mathbf{z}\mid \mathbf{x})$ in closed form:
$p(\mathbf{x})$  $=$  $\mathcal{N}(\bm{\mu},{\mathrm{\mathbf{W}\mathbf{W}}}^{\top}+{\sigma}^{2}\mathbf{I}),$  (5)  
$p(\mathbf{z}\mid \mathbf{x})$  $=$  $\mathcal{N}({\mathbf{M}}^{1}{\mathbf{W}}^{\top}(\mathbf{x}\bm{\mu}),{\sigma}^{2}{\mathbf{M}}^{1}),$  (6) 
where $\mathbf{M}={\mathbf{W}}^{\top}\mathbf{W}+{\sigma}^{2}\mathbf{I}$. This model is particularly interesting to analyze in the setting of variational inference, as the ELBO can also be computed in closed form (see Appendix C).
Stationary points of pPCA
We now characterize the stationary points of pPCA, largely repeating the thorough analysis of Tipping and Bishop (1999) (see Appendix A of their paper). The maximum likelihood estimate of $\bm{\mu}$ is the mean of the data. We can compute ${\mathbf{W}}_{\mathrm{MLE}}$ and ${\sigma}_{\mathrm{MLE}}^{2}$ as follows:
${\sigma}_{\mathrm{MLE}}^{2}$  $=$  $\frac{1}{nk}}{\displaystyle \sum _{j=k+1}^{n}}{\lambda}_{j},$  (7)  
${\mathbf{W}}_{\mathrm{MLE}}$  $=$  ${\mathbf{U}}_{k}{({\mathbf{\Lambda}}_{k}{\sigma}_{\mathrm{MLE}}^{2}\mathbf{I})}^{1/2}\mathbf{R}.$  (8) 
Here ${\mathbf{U}}_{k}$ corresponds to the first $k$ principal components of the data with the corresponding eigenvalues ${\lambda}_{1},\mathrm{\dots},{\lambda}_{k}$ stored in the $k\times k$ diagonal matrix ${\mathbf{\Lambda}}_{k}$. The matrix $\mathbf{R}$ is an arbitrary rotation matrix which accounts for weak identifiability in the model. We can interpret ${\sigma}_{MLE}^{2}$ as the average variance lost in the projection. The MLE solution is the global optimum. Other stationary points correspond to zeroing out columns of ${\mathbf{W}}_{\mathrm{MLE}}$ (posterior collapse).
Stability of ${\mathbf{W}}_{\mathrm{MLE}}$
In this section we consider ${\sigma}^{2}$ to be fixed and not necessarily equal to the MLE solution. Equation 8 remains a stationary point when the general ${\sigma}^{2}$ is swapped in. One surprising observation is that ${\sigma}^{2}$ directly controls the stability of the stationary points of the log marginal likelihood (see Appendix A). In Figure 1, we illustrate one such stationary point of pPCA for different values of ${\sigma}^{2}$. We computed this stationary point by taking $\mathbf{W}$ to have three principal component columns and zeros elsewhere. Each plot shows the same stationary point perturbed by two orthogonal vectors corresponding to other principal components.
The stability of the pPCA stationary points depends on the size of ${\sigma}^{2}$ — as ${\sigma}^{2}$ increases the stationary point tends towards a stable local maximum so that we cannot learn the additional components. Intuitively, the model prefers to explain deviations in the data with the larger observation noise. Fortunately, decreasing ${\sigma}^{2}$ will increase likelihood at these stationary points so that when learning ${\sigma}^{2}$ simultaneously these stationary points are saddle points (Tipping and Bishop, 1999). Therefore, learning ${\sigma}^{2}$ is necessary for gaining a full latent representation.
4.2 Linear VAEs recover pPCA
We now show that linear VAEs can recover the globally optimal solution to Probabilistic PCA. We will consider the following VAE model,
$\begin{array}{c}\hfill p(\mathbf{x}\mid \mathbf{z})=\mathcal{N}(\mathrm{\mathbf{W}\mathbf{z}}+\bm{\mu},{\sigma}^{2}\mathbf{I}),\\ \hfill q(\mathbf{z}\mid \mathbf{x})=\mathcal{N}(\mathbf{V}(\mathbf{x}\bm{\mu}),\mathbf{D}),\end{array}$  (9) 
where $\mathbf{D}$ is a diagonal covariance matrix, used globally for all of the data points. While this is a significant restriction compared to typical VAE architectures, which define an amortized variance for each input point, this is sufficient to recover the global optimum of the probabilistic model.
Lemma 1.
Proof.
Note that the global optimum of pPCA is defined up to an orthogonal transformation of the columns of $\mathbf{W}$, i.e., any rotation $\mathbf{R}$ in Eq. (8) results in a matrix ${\mathbf{W}}_{\mathrm{MLE}}$ that given ${\sigma}_{\mathrm{MLE}}^{2}$ attains maximum marginal likelihood. The linear VAE model defined in Eq. (9) is able to recover the global optimum of pPCA when $\mathbf{R}=\mathbf{I}$. Recall from Eq. (6) that $p(\mathbf{z}\mid \mathbf{x})$ is defined in terms of $\mathbf{M}={\mathbf{W}}^{\top}\mathbf{W}+{\sigma}^{2}\mathbf{I}$. When $\mathbf{R}=\mathbf{I}$, we obtain $\mathbf{M}={\mathbf{W}}_{\mathrm{MLE}}^{\top}{\mathbf{W}}_{\mathrm{MLE}}+{\sigma}_{\mathrm{MLE}}^{2}\mathbf{I}={\mathbf{\Lambda}}_{k}$, which is diagonal. Thus, setting $\mathbf{V}={\mathbf{M}}^{1}{\mathbf{W}}_{\mathrm{MLE}}^{\top}$ and $\mathbf{D}={\sigma}_{\mathrm{MLE}}^{2}{\mathbf{M}}^{1}={\sigma}_{\mathrm{MLE}}^{2}{\mathbf{\Lambda}}_{k}^{1}$, recovers the true posterior with diagonal covariance at the global optimum. In this case, the ELBO equals the log marginal likelihood and is maximized when the decoder has weights $\mathbf{W}={\mathbf{W}}_{\mathrm{MLE}}$. Because the ELBO lower bounds loglikelihood, the global maximum of the ELBO for the linear VAE is the same as the global maximum of the marginal likelihood for pPCA. ∎
The result of Lemma 1 is somewhat expected because the posterior of pPCA is Gaussian. Further details are given in Appendix C. In addition, we prove a more surprising result that suggests restricting the variational distribution to a Gaussian with a diagonal covariance structure allows one to identify the principal components at the global optimum of ELBO.
Corollary 1.
We discuss this result in Appendix B. This full identifiability is nontrivial and is not achieved even with the regularized linear autoencoder (Kunin et al., 2019).
So far, we have shown that at its global optimum the linear VAE recovers the pPCA solution, which enforces orthogonality of the decoder weight columns. However, the VAE is trained with the ELBO rather than the log marginal likelihood — often using SGD. The majority of existing work suggests that the KL term in the ELBO objective is responsible for posterior collapse. So, we should ask whether this term introduces additional spurious local maxima. Surprisingly, for the linear VAE model the ELBO objective does not introduce any additional spurious local maxima. We provide a sketch of the proof below with full details in Appendix C.
Theorem 1.
The ELBO objective for a linear VAE does not introduce any additional local maxima to the pPCA model.
Proof.
(Sketch) If the decoder has orthogonal columns, then the variational distribution recovers the true posterior at stationary points. Thus, the variational objective will exactly recover the log marginal likelihood. If the decoder does not have orthogonal columns then the variational distribution is no longer tight. However, the ELBO can always be increased by applying an infinitesimal rotation to the rightsingular vectors of the decoder towards identity: ${\mathbf{W}}^{\prime}\leftarrow {\mathrm{\mathbf{W}\mathbf{R}}}_{\u03f5}$ (so that the decoder columns are closer to orthogonal). This works because the variational distribution can fit the posterior more closely while the log marginal likelihood is invariant to rotations of the weight columns. Thus, any additional stationary points in the ELBO objective must necessarily be saddle points. ∎
The theoretical results presented in this section provide new intuition for posterior collapse in VAEs. In particular, the KL between the variational distribution and the prior is not entirely responsible for posterior collapse — log marginal likelihood has a role. The evidence for this is twofold. We have shown that log marginal likelihood may have spurious local maxima but also that in the linear case the ELBO objective does not add any additional spurious local maxima. Rephrased, in the linear setting the problem lies entirely with the probabilistic model. We should then ask, to what extent do these results hold in the nonlinear setting?
5 Deep Gaussian VAEs
The deep Gaussian VAE consists of a decoder ${D}_{\theta}$ and an encoder ${E}_{\varphi}$. The ELBO objective can be expressed as,
$$\mathcal{L}(\mathbf{x};\theta ,\varphi )=\mathrm{KL}({q}_{\varphi}(\mathbf{z}\mid \mathbf{x})\parallel p(\mathbf{z}))\frac{1}{2{\sigma}^{2}}{\mathbb{E}}_{{q}_{\varphi}(\mathbf{z}\mathbf{x})}\left[{\parallel {D}_{\theta}(\mathbf{z})\mathbf{x}\parallel}^{2}\right]\frac{1}{2}\mathrm{log}(2\pi {\sigma}^{2})$$  (10) 
The role of ${\sigma}^{2}$ in this objective invites a natural comparison to the $\beta $VAE objective (Higgins et al., 2016), where the KL term is weighted by $\beta \in {\mathbb{R}}^{+}$. Alemi et al. (2017) propose using small $\beta $ values to force powerful decoders to utilize the latent variables, but this comes at the cost of poor ELBO. Practitioners must then use downstream task performance for model selection, thus sacrificing one of the primary benefits of likelihoodbased models. However, for a given $\beta $, one can find a corresponding ${\sigma}^{2}$ (and a learning rate) such that the gradient updates to the network parameters are identical. Importantly, the Gaussian partition function for a Gaussian observation model (the last term on the RHS of Eq. (10)) prevents ELBO from deviating from the $\beta $VAE’s objective with a $\beta $weighted KL term while maintaining the benefits to representation learning when ${\sigma}^{2}$ is small. For the Gaussian VAE, this helps connect the dots between the role of local maxima and observation noise in posterior collapse vs. heuristic approaches that attempted to alleviate posterior collapse by diminishing the effect of the KL term (Bowman et al., 2015; Razavi et al., 2019; Sønderby et al., 2016; Huang et al., 2018). In the following section, we will study the nonlinear VAE empirically and explore connections to the linear theory.
6 Experiments
In this section, we present empirical evidence found from studying two distinct claims. First, we verify our theoretical analysis of the linear VAE model. Second, we explore to what extent these insights apply to deep nonlinear VAEs.
6.1 Linear VAEs
We ran two sets of experiments on 1000 randomly chosen MNIST images. First, we trained linear VAEs with learnable ${\sigma}^{2}$ for a range of hidden dimensions^{1}^{1} 1 The VAEs were trained using the analytic ELBO (Appendix C.1) and without minibatching gradients.. For each model, we compared the final ELBO to the maximumlikelihood of pPCA finding them to be essentially indistinguishable (as predicted by Lemma 1 and Theorem 1). For the second set of experiments, we took the pPCA MLE solution for $\mathbf{W}$ for each number of hidden dimensions and computed the likelihood under the observation noise which maximizes likelihood for 50 hidden dimensions. We observed that adding additional principal components (after 50) will initially improve likelihood but eventually adding more components (after 200) actually decreases the likelihood. In other words, the collapsed solution is actually preferred if the observation noise is not set correctly — we observe this theoretically through the stability of the stationary points (e.g. Figure 1).
Effect of stochastic ELBO estimates
In general, we are unable to compute the ELBO in closed form and so instead rely on unbiased Monte Carlo estimates using the reparameterization trick. These estimates add highvariance noise and can make optimization more challenging (Kingma and Welling, 2013). In the linear model, we can compare the solutions obtained using the stochastic ELBO gradients versus the analytic ELBO^{2}^{2} 2 We use 1000 MNIST images, as before, to enable fullbatch training so that the only source of noise is from the reparameterization trick (Kingma and Welling, 2013) (Figure 3). Additional experimental details are in Appendix E. We found that stochastic optimization had slower convergence (when compared to analytic training with the same learning rate) and, unsurprisingly, reached a worse final training ELBO value (in other words, worse steadystate risk due to the gradient variance).
Nonlinear Encoders
With a linear decoder and nonlinear encoder, Lemma 1 still holds, and the optimal variational distribution is the same as the true posterior has not changed. However, Corollary 1 and Theorem 1 no longer hold in general. Even a deep linear encoder will not have a unique global maximum and new stationary points (possibly maxima) may be introduced to ELBO in general. To investigate how deeper networks may impact optimization of the probabilistic model, we trained linear decoders with varying encoders using ELBO. We do not expect the linear encoder to be outperformed and indeed the empirical results support this (Figure 4).
6.2 Investigating posterior collapse in deep nonlinear VAEs
We explored how the analysis of the linear VAEs extends to deep nonlinear models. To do so, we trained VAEs with Gaussian observation models on the MNIST (LeCun, 1998) and CelebA (Liu et al., 2015) datasets. We apply uniform dequantization as in Papamakarios et al. (2017) in each case. We also adopt the nonlinear logit preprocessing transformation from Papamakarios et al. (2017) to provide fair comparisons with existing work. We also report results of models trained directly in pixel space in the appendix (there is no significant difference for the hypotheses we test).
Measuring posterior collapse
In order to measure the extent of posterior collapse, we introduce the following definition. We say that latent dimension dimension $i$ has $(\u03f5,\delta )$collapsed if $$. Note that the linear VAE can suffer $(0,0)$collapse. To estimate this practically, we compute the proportion of data samples which induce a variational distribution with KL divergence less than $\u03f5$ and finally report the percentage of dimensions which have $(\u03f5,\delta )$collapsed. Throughout this work, we fix $\delta =0.01$ and vary $\u03f5$.
Investigating ${\sigma}^{2}$
We trained MNIST VAEs with 2 hidden layers in both the decoder and encoder, ReLU activations, and 200 latent dimensions. We first evaluated training with fixed values of the observation noise, ${\sigma}^{2}$. This mirrors many public VAE implementations where ${\sigma}^{2}$ is fixed to 1 throughout training (also observed by Dai and Wipf (2019)), however, our linear analysis suggests that this is suboptimal. Then, we consider the setting where the observation noise and VAE weights are learned simultaneously.
In Table 1 we report the final ELBO of nonlinear VAEs trained on realvalued MNIST. For fixed ${\sigma}^{2}$, we found that the final models could have significant differences in ELBO which were maintained even after tuning ${\sigma}^{2}$ to the learned representations — the converged representations are less good when ${\sigma}^{2}$ is too large as predicted by the linear model. Additionally, we report the final ELBO values when the model is trained while learning ${\sigma}^{2}$ with different initial values of ${\sigma}^{2}$. The gap in performance across different initializations is smaller than for fixed ${\sigma}^{2}$ but is still significant. The linear VAE does not predict this gap which suggests that learning ${\sigma}^{2}$ correctly is more challenging in the nonlinear case.
Model  ELBO  ${\sigma}^{2}$tuned ELBO  Tuned ${\sigma}^{2}$  Posterior  KL  
Init ${\sigma}^{2}$  Final ${\sigma}^{2}$  collapse (%)  Divergence  
MNIST 
10.0  $1450.3\pm 4.2$  $1098.2\pm 28.3$  1.797  89.88  $28.8\pm 1.4$  
1.0  $1022.1\pm 5.4$  $1018.3\pm 5.3$  1.145  27.38  $125.4\pm 4.2$  
0.1  $3697.3\pm 493.3$  $1190.8\pm 37.4$  0.968  3.25  $368.7\pm 94.6$  
0.01  $38612.5\pm 1189.8$  $2090.8\pm 975.1$  0.877  0.00  $695.9\pm 118.1$  
0.001  $504259.1\pm 49149.8$  $1744.7\pm 48.4$  0.810  0.00  $756.2\pm 12.6$  
$10.0$  $1.320$  $1022.2\pm 4.5$  $1022.3\pm 4.6$  1.318  73.75  $73.8\pm 9.8$  
$1.0$  $1.183$  $1011.1\pm 2.7$  $1011.1\pm 2.8$  1.182  47.88  $106.3\pm 2.5$  
$0.1$  $1.194$  $1025.4\pm 8.6$  $1025.4\pm 8.6$  1.195  29.25  $116.1\pm 11.4$  
$0.01$  $1.194$  $1030.6\pm 3.5$  $1030.5\pm 3.5$  1.191  23.00  $121.9\pm 7.7$  
$0.001$  $1.208$  $1038.7\pm 5.6$  $1038.8\pm 5.6$  1.209  27.00  $124.9\pm 1.6$  
CELEBA 64 
10.0  $73328.4\pm 0.49$  $55186.7\pm 35.1$  0.2040  80.56  $56.12\pm 0.4$  
1.0  $59841.8\pm 30.1$  $51294.8\pm 333.7$  0.1020  2.52  $213.4\pm 6.3$  
0.1  $50760.3\pm 353.4$  $50698.5\pm 393.9$  0.0883  32.72  $483.8\pm 36.2$  
0.01  $82478.7\pm 1823.3$  $51373.9\pm 213.3$  0.0817  0.00  $1624.2\pm 8.8$  
0.001  $531924.5\pm 17177.6$  $57381.5\pm 512.6$  0.0296  0.00  $2680.2\pm 41.5$  
$10.0$  $0.0962$  $51109.5\pm 408.2$  $51109.5\pm 408.3$  0.0963  53.32  $364.5\pm 26.4$  
$1.0$  $0.0875$  $50631.2\pm 163.4$  $50631.0\pm 163.3$  0.0875  54.76  $462.2\pm 20.0$  
$0.1$  $0.0863$  $50646.9\pm 269.0$  $50645.9\pm 267.5$  0.0869  28.84  $520.9\pm 11.7$  
$0.01$  $0.0911$  $51285.0\pm 708.1$  $51284.8\pm 708.1$  0.0963  5.64  $557.0\pm 50.5$  
$0.001$  $0.1040$  $51695.1\pm 322.4$  $51694.8\pm 322.7$  0.0974  0.00  $537.5\pm 46.2$ 
Despite the large volume of work studying posterior collapse it has not been measured in a consistent way (or even defined so). In Figure 5 and Figure 6 we measure posterior collapse for trained networks as described above (we chose $\delta =0.01$). By considering a range of $\u03f5$ values we found this was (moderately) robust to stochasticity in data preprocessing. We observed that for large choices of ${\sigma}^{2}$ initialization the variational distribution matches the prior closely. This was true even when ${\sigma}^{2}$ is learned — suggesting that local optima may contribute to posterior collapse in deep VAEs.
CelebA VAEs
We trained deep convolutional VAEs with 500 hidden dimensions on images from the CelebA dataset (resized to 64x64). We trained the CelebA VAEs with different fixed values of ${\sigma}^{2}$ and compared the ELBO before and after tuning ${\sigma}^{2}$ to the learned representations (Table 1). Further, we explored training the CelebA VAE while learning ${\sigma}^{2}$ over varied initializations of the observation noise. The VAE is sensitive to the initialization of the observation noise even when ${\sigma}^{2}$ is learned (in particular, in terms of the number of collapsed dimensions).
7 Discussion
By analyzing the correspondence between linear VAEs and pPCA, this paper makes significant progress towards understanding the causes of posterior collapse. We show that for simple linear VAEs posterior collapse is caused by illconditioning of the stationary points in the log marginal likelihood objective. We demonstrate empirically that the same optimization issues play a role in deep nonlinear VAEs. Finally, we find that linear VAEs are useful theoretical testcases for evaluating existing hypotheses on VAEs and we encourage researchers to consider studying their hypotheses in the linear VAE setting.
8 Acknowledgements
This work was guided by many conversations with and feedback from our colleagues. In particular, we thank Durk Kingma, Alex Alemi, and Guodong Zhang for invaluable feedback on early versions of this work.
References
 TensorFlow: largescale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: Appendix E.
 Fixing a broken ELBO. arXiv preprint arXiv:1711.00464. Cited by: Appendix D, §3, §3, §5.
 Logisticnormal distributions: some properties and uses. Biometrika 67 (2), pp. 261–272. Cited by: §C.3.
 Neural networks and principal component analysis: learning from examples without local minima. Neural networks 2 (1), pp. 53–58. Cited by: §1.
 Latent variable models and factors analysis. Oxford University Press, Inc.. Cited by: §2.
 Variational inference: a review for statisticians. Journal of the American Statistical Association. Cited by: §1, §2.
 Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §C.3, §1, §1, §3, §5.
 Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §3.
 Information bottleneck for gaussian variables. Journal of machine learning research 6 (Jan), pp. 165–188. Cited by: §3.
 Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems. Cited by: §1.
 Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §3.
 Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558. Cited by: §3.
 Hidden talents of the variational autoencoder. arXiv preprint arXiv:1706.05148. Cited by: Appendix D, §3.
 Diagnosing and enhancing VAE models. In International Conference on Learning Representations, Cited by: Appendix D, §3, §6.2.
 Avoiding latent variable collapse with generative skip models. arXiv preprint arXiv:1807.04863. Cited by: §3.
 Automatic chemical design using a datadriven continuous representation of molecules. American Chemical Society Central Science. Cited by: §1.
 Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Learning Representations, Cited by: §3.
 BetaVAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Cited by: Appendix E, §1, §3, §5.
 Iterative refinement of the approximate posterior for directed belief networks. In Advances in Neural Information Processing Systems, Cited by: §3.
 Improving explorability in variational inference with annealed variational objectives. In Advances in Neural Information Processing Systems, Cited by: §1, §3, §5.
 An introduction to variational methods for graphical models. Machine learning. Cited by: §1, §2.
 Semiamortized variational autoencoders. arXiv preprint arXiv:1802.02550. Cited by: §3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §6.1, footnote 2.
 Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1, §3.
 Loss landscapes of regularized linear autoencoders. arXiv preprint arXiv:1901.08168. Cited by: Appendix B, §1, §3, §4.2.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §6.2.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: Appendix E, §6.2.
 MAE: mutual posteriordivergence regularization for variational autoencoders. In International Conference on Learning Representations, Cited by: §3.
 BIVA: a very deep hierarchy of latent variables for generative modeling. arXiv preprint arXiv:1902.02102. Cited by: §3, §3.
 Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, Cited by: Appendix E, Appendix E, Table 2, Table 3, Figure 4, §6.2.
 [32] The matrix cookbook. Cited by: §C.2.
 Preventing posterior collapse with deltaVAEs. In International Conference on Learning Representations, Cited by: §1, §3, §5.
 Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §1.
 Taming VAEs. arXiv preprint arXiv:1810.00597. Cited by: §3.
 Variational autoencoders pursue PCA directions (by accident). arXiv preprint arXiv:1812.06775. Cited by: §3.
 Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §1.
 Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §1, §3, §5.
 Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §A.1, Appendix A, §C.2, §1, §1, §2, §4.1, §4.1.
 VAE with a VampPrior. arXiv preprint arXiv:1705.07120. Cited by: Appendix D.
 Tackling overpruning in variational autoencoders. arXiv preprint arXiv:1706.03643. Cited by: §3.
Appendix A Stationary points of pPCA
Here we briefly summarize the analysis of [Tipping and Bishop, 1999] with some simple additional observations. We recommend that interested readers study Appendix A of Tipping and Bishop [1999] for the full details. We begin by formulating the conditions for stationary points of ${\sum}_{{\mathbf{x}}_{i}}\mathrm{log}p({\mathbf{x}}_{i})$:
$${\mathrm{\mathbf{S}\mathbf{C}}}^{1}\mathbf{W}=\mathbf{W}$$  (11) 
Where $\mathbf{S}$ denotes the sample covariance matrix (assuming we set $\bm{\mu}={\bm{\mu}}_{MLE}$, which we do throughout), and $\mathbf{C}={\mathrm{\mathbf{W}\mathbf{W}}}^{T}+{\sigma}^{2}I$ (note that the dimensionality is different to $\mathbf{M}$). There are three possible solutions to this equation, (1) $\mathbf{W}=\mathrm{\U0001d7ce}$, (2) $\mathbf{C}=\mathbf{S}$, or (3) the more general solutions. (1) and (2) are not particularly interesting to us, so we focus herein on (3).
We can write $\mathbf{W}={\mathrm{\mathbf{U}\mathbf{L}\mathbf{V}}}^{T}$ using its singular value decomposition. Substituting back into the stationary points equation, we recover the following:
$$\mathrm{\mathbf{S}\mathbf{U}\mathbf{L}}=\mathbf{U}({\sigma}^{2}I+{\mathbf{L}}^{2})\mathbf{L}$$  (12) 
Noting that $\mathbf{L}$ is diagonal, if the ${j}^{th}$ singular value (${l}_{j}$) is nonzero, this gives ${\mathrm{\mathbf{S}\mathbf{u}}}_{j}=({\sigma}^{2}+{l}_{j}^{2}){\mathbf{u}}_{j}$, where ${u}_{j}$ is the ${j}^{th}$ column of $\mathbf{U}$. Thus, ${\mathbf{u}}_{j}$ is an eigenvector of $\mathbf{S}$ with eigenvalue ${\lambda}_{j}={\sigma}^{2}+{l}_{j}^{2}$. For ${l}_{j}=0$, ${\mathbf{u}}_{j}$ is arbitrary.
Thus, all potential solutions can be written as, $\mathbf{W}={U}_{q}{({K}_{q}{\sigma}^{2}I)}^{1/2}\mathbf{R}$, with singular values written as ${k}_{j}={\sigma}^{2}$ or ${\sigma}^{2}+{l}_{j}^{2}$ and with $\mathbf{R}$ representing an arbitrary orthogonal matrix.
From this formulation, one can show that the global optimum is attained with ${\sigma}^{2}={\sigma}_{MLE}^{2}$ and ${U}_{q}$ and ${K}_{q}$ chosen to match the leading singular vectors and values of $\mathbf{S}$.
A.1 Stability of stationary point solutions
Consider stationary points of the form, $\mathbf{W}={\mathbf{U}}_{q}{({K}_{q}{\sigma}^{2}I)}^{1/2}$ where ${\mathbf{U}}_{q}$ contains arbitrary eigenvectors of $\mathbf{S}$. In the original pPCA paper they show that all solutions except the leading principal components correspond to saddle points in the optimization landscape. However, this analysis depends critically on ${\sigma}^{2}$ being set to the true maximum likelihood estimate. Here we repeat their analysis, considering other (fixed) values of ${\sigma}^{2}$.
We consider a small perturbation to a column of $\mathbf{W}$, of the form $\u03f5{\mathbf{u}}_{j}$ . To analyze the stability of the perturbed solution, we check the sign of the dotproduct of the perturbation with the likelihood gradient at ${\mathbf{w}}_{i}+\u03f5{\mathbf{u}}_{j}$. Ignoring terms in ${\u03f5}^{2}$ we can write the dotproduct as,
$$\u03f5N({\lambda}_{j}/{k}_{i}1){\mathbf{u}}_{j}^{T}{\mathbf{C}}^{1}{\mathbf{u}}_{j}$$  (13) 
Now, ${\mathbf{C}}^{1}$ is positive definite and so the sign depends only on ${\lambda}_{j}/{k}_{i}1$. The stationary point is stable (local maxima) only if the sign is negative. If ${k}_{i}={\lambda}_{i}$ then the maxima is stable only when ${\lambda}_{i}>{\lambda}_{j}$, in words, the top $q$ principal components are stable. However, we must also consider the case $k={\sigma}^{2}$. Tipping and Bishop [1999] show that if ${\sigma}^{2}={\sigma}_{MLE}^{2}$, then this also corresponds to a saddle point as ${\sigma}^{2}$ is the average of the smallest eigenvalues meaning some perturbation will be unstable (except in a special case which is handled separately).
However, what happens if ${\sigma}^{2}$ is not set to be the maximum likelihood estimate? In this case, it is possible that there are no unstable perturbation directions (that is, $$ for too many $j$). In this case when ${\sigma}^{2}$ is fixed, there are local optima where $\mathbf{W}$ has zerocolumns — the same solutions that we observe in nonlinear VAEs corresponding to posterior collapse. Note that when ${\sigma}^{2}$ is learned in nondegenerate cases the local maxima presented above become saddle points where ${\sigma}^{2}$ is made smaller by its gradient. In practice, we find that even when ${\sigma}^{2}$ is learned in the nonlinear case local maxima exist.
Appendix B Identifiability of the linear VAE
Linear autoencoders suffer from a lack of identifiability which causes the decoder columns to span the principal component subspace instead of recovering it. Kunin et al. [2019] showed that adding regularization to the linear autoencoder improves the identifiability — forcing the columns to be identified up to an arbitrary orthogonal transformation, as in pPCA. Here we show that linear VAEs are able to fully identify the principal components.
We once again consider the linear VAE from Eq. (9):
$\begin{array}{c}\hfill p(\mathbf{x}\mid \mathbf{z})=\mathcal{N}(\mathrm{\mathbf{W}\mathbf{z}}+\bm{\mu},{\sigma}^{2}\mathbf{I}),\\ \hfill q(\mathbf{z}\mid \mathbf{x})=\mathcal{N}(\mathbf{V}(\mathbf{x}\bm{\mu}),\mathbf{D}),\end{array}$ 
The output of the VAE, $\stackrel{~}{\mathbf{x}}$ is distributed as,
$$\stackrel{~}{\mathbf{x}}\mathbf{x}\sim \mathcal{N}(\mathrm{\mathbf{W}\mathbf{V}}(\mathbf{x}\bm{\mu})+\bm{\mu},{\mathrm{\mathbf{W}\mathbf{D}\mathbf{W}}}^{T}).$$ 
Therefore, the output of the linear VAE is invariant to the following transformation:
$\begin{array}{cc}\hfill \mathbf{W}& \leftarrow \mathrm{\mathbf{W}\mathbf{A}},\hfill \\ \hfill \mathbf{V}& \leftarrow {\mathbf{A}}^{1}\mathbf{V},\hfill \\ \hfill \mathbf{D}& \leftarrow {\mathbf{A}}^{1}{\mathrm{\mathbf{D}\mathbf{A}}}^{1},\hfill \end{array}$  (14) 
where $\mathbf{A}$ is a diagonal matrix with nonzero entries so that $\mathbf{D}$ is welldefined. However, this transformation changes the variational distribution which affects the loss through the KL term. As argued in Corollary 1, this means that the global optimum is unique for ELBO up to ordering of the eigenvalues/eigenvectors.
At the global optimum, the ordering can be recovered by computing the squared Euclidean norm of the columns of $\mathbf{W}$ (which correspond to the singular values) and ordering according to these quantities. In other words, $\mathbf{R}$ is a permutation matrix which can be computed exactly.
Appendix C Stationary points of ELBO
Here we present details on the analysis of the stationary points of the ELBO objective. To begin, we first derive closedform solutions to the components of the log marginal likelihood (including the ELBO). The VAE we focus on is the one presented in Eq. (9), with a linear encoder, linear decoder, Gaussian prior, and Gaussian observation model.
C.1 Analytic ELBO of the Linear VAE
Remember that one can express the log marginal likelihood as:
$$\mathrm{log}p(\mathbf{x})=\stackrel{(A)}{KL(q(\mathbf{z}\mathbf{x})p(\mathbf{z}\mathbf{x}))}\stackrel{(B)}{KL(q(\mathbf{z}\mathbf{x})p(\mathbf{z}))}+\stackrel{(C)}{{\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[\mathrm{log}p(\mathbf{x}\mathbf{z})\right]}.$$  (15) 
Each of the terms (AC) can be expressed in closed form for the linear VAE. Note that the KL term (A) is minimized when the variational distribution is exactly the true posterior distribution. This is possible when the columns of the decoder are orthogonal.
The term (B) can be expressed as,
$$KL(q(\mathbf{z}\mathbf{x})p(z))=0.5(\mathrm{log}det\mathbf{D}+{(\mathbf{x}\bm{\mu})}^{T}{\mathbf{V}}^{T}\mathbf{V}(\mathbf{x}\bm{\mu})+tr(\mathbf{D})q).$$  (16) 
The term (C) can be expressed as,
${\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[\mathrm{log}p(\mathbf{x}\mathbf{z})\right]$  $={\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[{(\mathrm{\mathbf{W}\mathbf{z}}(\mathbf{x}\bm{\mu}))}^{T}(\mathrm{\mathbf{W}\mathbf{z}}(\mathbf{x}\bm{\mu}))/2{\sigma}^{2}{\displaystyle \frac{d}{2}}\mathrm{log}2\pi {\sigma}^{2}\right]$  (17)  
$={\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[{\displaystyle \frac{{(\mathrm{\mathbf{W}\mathbf{z}})}^{T}(\mathrm{\mathbf{W}\mathbf{z}})+2{(\mathbf{x}\bm{\mu})}^{T}\mathrm{\mathbf{W}\mathbf{z}}{(\mathbf{x}\bm{\mu})}^{T}(\mathbf{x}\bm{\mu})}{2{\sigma}^{2}}}{\displaystyle \frac{d}{2}}\mathrm{log}2\pi {\sigma}^{2}\right].$  (18) 
Noting that $\mathrm{\mathbf{W}\mathbf{z}}\sim \mathcal{N}(\mathrm{\mathbf{W}\mathbf{V}}(\mathbf{x}\bm{\mu}),{\mathrm{\mathbf{W}\mathbf{D}\mathbf{W}}}^{T})$, we can compute the expectation analytically and obtain,
${\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}\left[\mathrm{log}p(\mathbf{x}\mathbf{z})\right]$  $={\displaystyle \frac{1}{2{\sigma}^{2}}}[tr({\mathrm{\mathbf{W}\mathbf{D}\mathbf{W}}}^{T}){(\mathbf{x}\bm{\mu})}^{T}{\mathbf{V}}^{T}{\mathbf{W}}^{T}\mathrm{\mathbf{W}\mathbf{V}}(\mathbf{x}\bm{\mu})$  (19)  
$+2{(\mathbf{x}\bm{\mu})}^{T}\mathrm{\mathbf{W}\mathbf{V}}(\mathbf{x}\bm{\mu}){(\mathbf{x}\bm{\mu})}^{T}(\mathbf{x}\bm{\mu})]{\displaystyle \frac{d}{2}}\mathrm{log}2\pi \sigma {}^{2}.$  (20) 
C.2 Finding stationary points
To compute the stationary points we must take derivatives with respect to $\bm{\mu},\mathbf{D},\mathbf{W},\mathbf{V},{\sigma}^{2}$. As before, we have $\bm{\mu}={\bm{\mu}}_{MLE}$ at the global maximum and for simplicity we fix $\bm{\mu}$ here for the remainder of the analysis.
Taking the marginal likelihood over the whole dataset, at the stationary points we have,
$\frac{\partial}{\partial \mathbf{D}}}((B)+(C))$  $={\displaystyle \frac{N}{2}}({\mathbf{D}}^{1}\mathbf{I}{\displaystyle \frac{1}{{\sigma}^{2}}}\text{diag}({\mathbf{W}}^{T}\mathbf{W}))=0$  (21)  
$\frac{\partial}{\partial \mathbf{V}}}((B)+(C))$  $={\displaystyle \frac{N}{{\sigma}^{2}}}({\mathbf{W}}^{T}({\mathbf{W}}^{T}\mathbf{W}+{\sigma}^{2}\mathbf{I})\mathbf{V})\mathbf{S}=0$  (22)  
$\frac{\partial}{\partial \mathbf{W}}}((B)+(C))$  $={\displaystyle \frac{N}{{\sigma}^{2}}}({\mathrm{\mathbf{S}\mathbf{V}}}^{T}\mathrm{\mathbf{D}\mathbf{W}}{\mathrm{\mathbf{W}\mathbf{V}\mathbf{S}\mathbf{V}}}^{T})=0$  (23) 
The above are computed using standard matrix derivative identities [Petersen and others, ]. These equations yield the expected solution for the variational distribution directly. From Eq. (21) we compute ${\mathbf{D}}^{*}={\sigma}^{2}{(\text{diag}({\mathbf{W}}^{T}\mathbf{W})+{\sigma}^{2}\mathbf{I})}^{1}$ and ${\mathbf{V}}^{*}={\mathbf{M}}^{1}{\mathbf{W}}^{T}$, recovering the true posterior mean in all cases and getting the correct posterior covariance when the columns of $\mathbf{W}$ are orthogonal. We will now proceed with the proof of Theorem 1.
See 1
Proof.
If the columns of $\mathbf{W}$ are orthogonal then the log marginal likelihood is recovered exactly at all stationary points. This is a direct consequence of the posterior mean and covariance being recovered exactly at all stationary points so that (1) is zero.
We must give separate treatment to the case where there is a stationary point without orthogonal columns of $\mathbf{W}$. Suppose we have such a stationary point, using the singular value decomposition we can write $\mathbf{W}={\mathrm{\mathbf{U}\mathbf{L}\mathbf{R}}}^{T}$, where $\mathbf{U}$ and $\mathbf{R}$ are orthogonal matrices. Note that $\mathrm{log}p(\mathbf{x})$ is invariant to the choice of $\mathbf{R}$ [Tipping and Bishop, 1999]. However, the choice of $\mathbf{R}$ does affect the first term (1) of Eq. (15): this term is minimized when $\mathbf{R}=\mathbf{I}$, and thus the ELBO must increase.
To formalize this argument, we compute (1) at a stationary point. From above, at every stationary point the mean of the variational distribution exactly matches the true posterior. Thus the KL simplifies to:
$KL(q(\mathbf{z}\mathbf{x})p(\mathbf{z}\mathbf{x}))$  $={\displaystyle \frac{1}{2}}\left(tr({\displaystyle \frac{1}{{\sigma}^{2}}}\mathrm{\mathbf{M}\mathbf{D}})q+q\mathrm{log}{\sigma}^{2}\mathrm{log}(det\mathbf{M}det\mathbf{D})\right),$  (24)  
$={\displaystyle \frac{1}{2}}\left(tr(\mathbf{M}{\stackrel{~}{\mathbf{M}}}^{1})q\mathrm{log}{\displaystyle \frac{det\mathbf{M}}{det\stackrel{~}{\mathbf{M}}}}\right),$  (25)  
$={\displaystyle \frac{1}{2}}\left({\displaystyle \sum _{i=1}^{q}}{\displaystyle \frac{{\mathbf{M}}_{ii}}{{\mathbf{M}}_{ii}}}q\mathrm{log}det\mathbf{M}+\mathrm{log}det\stackrel{~}{\mathbf{M}}\right),$  (26)  
$={\displaystyle \frac{1}{2}}\left(\mathrm{log}det\stackrel{~}{\mathbf{M}}\mathrm{log}det\mathbf{M}\right),$  (27) 
where $\stackrel{~}{\mathbf{M}}=\text{diag}({\mathbf{W}}^{T}\mathbf{W})+{\sigma}^{2}\mathbf{I}$. Now consider applying a small rotation to $\mathbf{W}$: $\mathbf{W}\mapsto {\mathrm{\mathbf{W}\mathbf{R}}}_{\u03f5}$. As the optimal $\mathbf{D}$ and $\mathbf{V}$ are continuous functions of $\mathbf{W}$, this corresponds to a small perturbation of these parameters too for a sufficiently small rotation. Importantly, $\mathrm{log}det\mathbf{M}$ remains fixed for any orthogonal choice of ${\mathbf{R}}_{\u03f5}$ but $\mathrm{log}det\stackrel{~}{\mathbf{M}}$ does not. Thus, we choose ${\mathbf{R}}_{\u03f5}$ to minimize this term. In this manner, (1) shrinks meaning that the ELBO (2)+(3) must increase. Thus if the stationary point existed, it must have been a saddle point.
We now describe how to construct such a small rotation matrix. First note that without loss of generality we can assume that $det(\mathbf{R})=1$. (Otherwise, we can flip the sign of a column of $\mathbf{R}$ and the corresponding column of $\mathbf{U}$.) And additionally, we have $\mathrm{\mathbf{W}\mathbf{R}}=\mathrm{\mathbf{U}\mathbf{L}}$, which is orthogonal.
The Special Orthogonal group of determinant 1 orthogonal matrices is a compact, connected Lie group and therefore the exponential map from its Lie algebra is surjective. This means that we can find an uppertriangular matrix $\mathbf{B}$, such that $\mathbf{R}=\mathrm{exp}\{\mathbf{B}{\mathbf{B}}^{T}\}$. Consider ${\mathbf{R}}_{\u03f5}=\mathrm{exp}\{\frac{1}{n(\u03f5)}(\mathbf{B}{\mathbf{B}}^{T})\}$, where $n(\u03f5)$ is an integer chosen to ensure that the elements of $\mathbf{B}$ are within $\u03f5>0$ of zero. This matrix is a rotation in the direction of $\mathbf{R}$ which we can make arbitrarily close to the identity by a suitable choice of $\u03f5$. This is verified through the Taylor series expansion of ${\mathbf{R}}_{\u03f5}=I+\frac{1}{n(\u03f5)}(\mathbf{B}{\mathbf{B}}^{T})+O({\u03f5}^{2})$. Thus, we have identified a small perturbation to $\mathbf{W}$ (and $\mathbf{D}$ and $\mathbf{V}$) which decreases the posterior KL (A) but keeps the log marginal likelihood constant. Thus, the ELBO increases and the stationary point must be a saddle point. ∎
C.3 Bernoulli Probabilistic PCA
We would like to extend our linear analysis to the case where we have a Bernoulli observation model, as this setting also suffers severely from posterior collapse. The analysis may also shed light on more general categorical observation models which have also been used. Typically, in these settings a continuous latent space is still used (for example, Bowman et al. [2015]).
We will consider the following model,
$\begin{array}{c}\hfill p(\mathbf{z})=\mathcal{N}(0,\mathbf{I}),\\ \hfill p(\mathbf{x}\mathbf{z})=\text{Bernoulli}(\mathbf{y}),\\ \hfill \mathbf{y}=\sigma (\mathrm{\mathbf{W}\mathbf{z}}+\bm{\mu})\end{array}$  (29) 
where $\sigma $ denotes the sigmoid function, $\sigma (y)=1/(1+\mathrm{exp}(y))$ and we assume an independent Bernoulli observation model over $\mathbf{x}$.
Unfortunately, under this model it is difficult to reason about the stationary points. There is no closed form solution for the marginal likelihood $p(\mathbf{x})$ or the posterior distribution $p(\mathbf{z}\mathbf{x})$. Numerical integration methods exist which may make it easy to evaluate this quantity in practice but they will not immediately provide us a good gradient signal.
We can compute the density function for $\mathbf{y}$ using the change of variables formula. Noting that $\mathrm{\mathbf{W}\mathbf{z}}+\bm{\mu}\sim \mathcal{N}(\bm{\mu},{\mathrm{\mathbf{W}\mathbf{W}}}^{T})$, we recover the following logitNormal distribution:
$$f(\mathbf{y})=\frac{1}{\sqrt{2\pi {\mathrm{\mathbf{W}\mathbf{W}}}^{T}}}\frac{1}{{\mathrm{\Pi}}_{i}{y}_{i}(1{y}_{i})}\mathrm{exp}\{\frac{1}{2}{\left(\mathrm{log}(\frac{\mathbf{y}}{1\mathbf{y}})\bm{\mu}\right)}^{T}{({\mathrm{\mathbf{W}\mathbf{W}}}^{T})}^{1}\left(\mathrm{log}(\frac{\mathbf{y}}{1\mathbf{y}})\bm{\mu}\right)\}$$  (30) 
We can write the marginal likelihood as,
$p(\mathbf{x})$  $={\displaystyle \int p(\mathbf{x}\mathbf{z})p(\mathbf{z})\mathit{d}\mathbf{z}},$  (31)  
$={\mathbb{E}}_{\mathbf{z}}\left[\mathbf{y}{(\mathbf{z})}^{\mathbf{x}}{(1\mathbf{y}(\mathbf{z}))}^{1\mathbf{x}}\right],$  (32) 
where ${(\cdot )}^{\mathbf{x}}$ is taken to be elementwise. Unfortunately, the expectation of a logitnormal distribution has no closed form [Atchison and Shen, 1980] and so we cannot tractably compute the marginal likelihood.
Similarly, under ELBO we need to compute the expected reconstruction error. This can be written as,
$${\mathbb{E}}_{q(\mathbf{z}\mathbf{x})}[\mathrm{log}p(\mathbf{x}\mathbf{z})]=\int \mathbf{y}{(\mathbf{z})}^{\mathbf{x}}{(1\mathbf{y}(\mathbf{z}))}^{1\mathbf{x}}\mathcal{N}(\mathbf{z};\mathbf{V}(\mathbf{x}\bm{\mu}),\mathbf{D})\mathit{d}\mathbf{z},$$  (33) 
another intractable integral.
Appendix D Related Work (Extended)
Due to the large volume of work studying posterior collapse in variational autoencoders, we have included here an extended discussion of related work. We utilize this additional space to provide a more indepth discussion of the related work presented in the main paper and to highlight additional work.
Tomczak and Welling [2017] introduce the VampPrior, a hierarchical learned prior for VAEs. Tomczak and Welling [2017] show empirically that such a learned prior can mitigate posterior collapse (which they refer to as inactive stochastic units). While the authors provide limited theoretical support for the efficacy of their method in reducing posterior collapse, they claim intuitively that by enabling multimodal prior distributions the KL term is less likely to force inactive units — possibly by reducing the impact of local optima corresponding to posterior collapse.
In the main paper we discuss the work of Dai et al. [2017], which connect robust PCA methods and VAEs. In particular, Section 2 of their manuscript studies the case of a linear decoder and shows that, when the encoder takes the form of the optimal variational distribution, the ELBO of the resulting VAE collapses into the pPCA objective. We study the ELBO without optimality assumptions on the linear encoder and characterize the optimization landscape with no additional assumptions. They claim further that all minima of the (encoderoptimal) ELBO objective are globally optimal — we show in fact that for a linear encoder there is a fully identifiable global optimum.
Dai and Wipf [2019] discuss the important of the observation noise, and in fact show that under some assumptions the optimal observation noise should shrink to zero (Theorem 4 in their work). These assumptions amount to the number of latent dimensions exceeding the dimensionality of the true data manifold. However, in the linear model (whose latent dimensions do not exceed the input space dimensionality) the optimal variance does not shrink towards zero and is instead given by the sum of the variance lost in the linear projection. Note that this does not violate the results of Dai and Wipf [2019], but highlights the need to consider model capacity against data complexity, as in Alemi et al. [2017].
Appendix E Experiment details
We used Tensorflow [Abadi et al., 2015] for our experiments with linear and deep VAEs. In each case, the models were trained using a single GPU.
Visualizing stationary points of pPCA
For this experiment we computed the pPCA MLE using a subset of 1000 random training images from the MNIST dataset. We evaluate and plot the log marginal likelihood in closed form on this same subset. In this case, we did not dequantize or apply any nonlinear processing to the data.
Stochastic vs. Analytic VAE
We trained linear VAEs with 200 hidden dimensions. We used fullbatch training with 1000 MNIST digits samples randomly from the training set (the same data as used to produce Figure 2). We trained each model with the Adam optimizer and a fixed learning rate, grid searching to find the learning rate which gave the best ELBO after 12000 training steps in the range $\{0.0001,0.0003,0.001,0.003\}$. For both models, 0.001 provided the best final ELBO.
MNIST VAE
The VAEs we trained on MNIST all had the same architecture: 7841024512k5121024784. The Gaussian likelihood is fairly uncommon for this dataset, which is nearly binary, but it provides a good setting for us to investigate our theoretical findings. To dequantize the data, we added uniform random noise and rescaled the pixel values to be in the range $[0,1]$. We then applied a nonlinear logistic transform as in [Papamakarios et al., 2017]. The VAE parameters were optimized jointly using the Adam optimizer [Kingma and Ba, 2014]. We trained the VAE for 1000 epochs total, keeping the learning rate fixed throughout. We performed a grid search over learning rates in the range $\{0.0001,0.0003,0.001,0.003\}$ and reported results for the model which achieved the best training ELBO.
CelebA VAE
E.1 Additional results
E.1.1 Evaluating KL Annealing
We found that KLannealing may provide temporary relief from posterior collapse but that if ${\sigma}^{2}$ is not learned simultaneously then the collapsed solution is recovered. In Figure 7 we show the proportion of units collapsed by threshold for several fixed choices of ${\sigma}^{2}$ when $\beta $ is annealed from 0 to 1 over the first 100 epochs. The solid lines correspond to the final model while the dashed line corresponds to the model at 80 epochs of training. KLannealing was able to reduce posterior collapse initially but eventually fell back to the collapsed solution.
After finding that KLannealing alone was insufficient to prevent posterior collapse we explored KL annealing while learning ${\sigma}^{2}$. Based on our analysis in the linear case we expect that this should work well: while $\beta $ is small the model should be able to learn to reduce ${\sigma}^{2}$. We trained using the same KL schedule and also with standard ELBO while learning ${\sigma}^{2}$. The results are presented in Figure 8 and Figure 9. Under the ELBO objective, ${\sigma}^{2}$ is reduced somewhat but ultimately a large degree of posterior collapse is present. Using KLannealing, the VAE is able to learn a much smaller ${\sigma}^{2}$ value and ultimately reduces posterior collapse. This suggests that the nonlinear VAE dynamics may be similar to the linear case when suitably conditioned.
E.1.2 Full results tables
Model  ELBO  ${\sigma}^{2}$tuned ELBO  Tuned ${\sigma}^{2}$  Posterior  KL  

Init ${\sigma}^{2}$  Final ${\sigma}^{2}$  collapse (%)  Divergence  
MNIST 
30.0  $1850.4\pm 29.0$  $1374.9\pm 199.0$  4.451  95.00  $10.9\pm 6.7$  
10.0  $1450.3\pm 4.2$  $1098.2\pm 28.3$  1.797  89.88  $28.8\pm 1.4$  
3.0  $1114.9\pm 1.1$  $1018.8\pm 1.0$  1.361  76.75  $58.5\pm 1.4$  
1.0  $1022.1\pm 5.4$  $1018.3\pm 5.3$  1.145  27.38  $125.4\pm 4.2$  
0.3  $1816.7\pm 270.6$  $1104.6\pm 6.2$  1.275  2.00  $179.3\pm 85.9$  
0.1  $3697.3\pm 493.3$  $1190.8\pm 37.4$  0.968  3.25  $368.7\pm 94.6$  
0.03  $18549.3\pm 4892.0$  $1283.2\pm 63.3$  1.470  0.00  $305.3\pm 75.4$  
0.01  $38612.5\pm 1189.8$  $1403.1\pm 21.0$  1.006  0.00  $560.9\pm 32.4$  
0.003  $139538.8\pm 21148.5$  $2090.8\pm 975.1$  0.877  0.00  $695.9\pm 118.1$  
0.001  $504259.1\pm 49149.8$  $1744.7\pm 48.4$  0.810  0.00  $756.2\pm 12.6$  
30.0  1.478  $1060.9\pm 23.1$  $1061.0\pm 23.0$  1.476  33.75  $70.9\pm 13.8$  
10.0  1.32  $1022.2\pm 4.5$  $1022.3\pm 4.6$  1.318  73.75  $73.8\pm 9.8$  
3.0  1.178  $1004.6\pm 1.4$  $1004.5\pm 1.3$  1.181  58.38  $99.8\pm 1.5$  
1.0  1.183  $1011.1\pm 2.7$  $1011.1\pm 2.8$  1.182  47.88  $106.3\pm 2.5$  
0.3  1.195  $1020.0\pm 6.0$  $1019.9\pm 6.1$  1.191  37.75  $111.6\pm 6.1$  
0.1  1.194  $1025.4\pm 8.6$  $1025.4\pm 8.6$  1.195  29.25  $116.1\pm 11.4$  
0.03  1.197  $1030.6\pm 6.6$  $1030.5\pm 6.6$  1.198  22.62  $120.2\pm 10.5$  
0.01  1.194  $1030.6\pm 3.5$  $1030.5\pm 3.5$  1.191  23.00  $121.9\pm 7.7$  
0.003  1.19  $1033.7\pm 2.3$  $1033.6\pm 2.3$  1.187  16.62  $126.4\pm 6.8$  
0.001  1.208  $1038.7\pm 5.6$  $1038.8\pm 5.6$  1.209  27.00  $124.9\pm 1.6$ 
Model  ELBO  ${\sigma}^{2}$tuned ELBO  Tuned ${\sigma}^{2}$  Posterior  KL  

Init ${\sigma}^{2}$  Final ${\sigma}^{2}$  collapse (%)  Divergence  
CELEBA 64 
30.0  $79986.2\pm 0.10$  $57883.8\pm 19.3$  0.423  93.68  $26.0\pm 0.2$  
10.0  $73328.4\pm 0.49$  $55186.7\pm 35.1$  0.204  80.56  $56.12\pm 0.4$  
3.0  $66145.6\pm 2.44$  $52828.5\pm 58.6$  0.132  20.64  $120.4\pm 1.4$  
1.0  $59841.8\pm 30.1$  $51294.8\pm 333.7$  0.102  2.52  $213.4\pm 6.3$  
0.3  $54370.4\pm 849.9$  $52155.2\pm 1855.2$  0.122  74.52  $267.2\pm 51.9$  
0.1  $50760.3\pm 353.4$  $50698.5\pm 393.9$  0.0883  32.72  $483.8\pm 36.2$  
0.03  $64322.8\pm 312.9$  $58077.9\pm 206.2$  0.0463  0.00  $1521.1\pm 11.6$  
0.01  $82478.7\pm 1823.3$  $51373.9\pm 213.3$  0.0817  0.00  $1624.2\pm 8.8$  
0.003  $192967.7\pm 4410.4$  $51978.4\pm 159.3$  0.0685  0.00  $2108.4\pm 26.2$  
0.001  $531924.5\pm 17177.6$  $57381.5\pm 512.6$  0.0296  0.00  $2680.2\pm 41.5$  
30.0  0.478  $57773.0\pm 3622.9$  $56068.5\pm 2771.0$  0.475  14.20  $221.7\pm 99.0$  
10.0  0.0962  $51109.5\pm 408.2$  $51109.5\pm 408.3$  $0.0963$  53.32  $364.5\pm 26.4$  
3.0  0.0891  $50813.2\pm 229.7$  $50813.3\pm 229.7$  0.0889  10.96  $545.2\pm 5.5$  
1.0  0.0875  $50631.2\pm 163.4$  $50631.0\pm 163.3$  0.0875  54.76  $462.2\pm 20.0$  
0.3  0.0890  $50963.4\pm 331.2$  $50963.2\pm 331.3$  0.0892  7.96  $670.7\pm 79.2$  
0.1  0.0863  $50646.9\pm 269.0$  $50645.9\pm 267.5$  0.0869  28.84  $520.9\pm 11.7$  
0.03  0.121  $53263.4\pm 71.5$  $53263.3\pm 71.3$  0.126  0.00  $856.2\pm 19.7$  
0.01  0.0911  $51285.0\pm 708.1$  $51284.8\pm 708.1$  0.0963  5.64  $557.0\pm 50.5$  
0.003  0.0952  $51056.4\pm 1216.9$  $51055.9\pm 1217.4$  0.094  0.80  $577.4\pm 30.4$  
0.001  0.104  $51695.1\pm 322.4$  $51694.8\pm 322.7$  0.0974  0.00  $537.5\pm 46.2$ 
Model  ELBO  ${\sigma}^{2}$tuned ELBO  Tuned ${\sigma}^{2}$  Posterior  KL  

Init ${\sigma}^{2}$  Final ${\sigma}^{2}$  collapse (%)  Divergence  
MNIST 
30.0  $6402.0\pm 0.0$  $6248.4\pm 197.2$  22.323  0.00  $0.0\pm 0.0$  
10.0  $5973.1\pm 0.0$  $5821.0\pm 194.6$  7.443  0.00  $0.0\pm 0.0$  
3.0  $5507.1\pm 0.1$  $5360.4\pm 185.4$  2.235  1.70  $0.6\pm 0.3$  
1.0  $5087.9\pm 3.1$  $4954.7\pm 156.9$  0.747  0.00  $4.5\pm 2.3$  
0.3  $4638.4\pm 3.6$  $4516.8\pm 137.9$  0.225  0.00  $12.5\pm 1.5$  
0.1  $4243.1\pm 17.6$  $4154.6\pm 62.1$  0.076  0.00  $25.6\pm 3.0$  
0.03  $3820.7\pm 13.9$  $3785.2\pm 26.6$  0.027  0.00  $55.8\pm 2.1$  
0.01  $3508.4\pm 12.3$  $3483.5\pm 13.1$  0.009  0.00  $112.8\pm 6.7$  
0.003  $3267.3\pm 2.6$  $3247.1\pm 2.8$  0.003  0.00  $252.2\pm 2.1$  
0.001  $3137.7\pm 5.2$  $3136.7\pm 5.4$  0.001  0.00  $422.7\pm 2.6$  
30.0  0.067  $4398.7\pm 0.0$  $4398.7\pm 0.0$  0.067  0.00  $0.0\pm 0.0$  
10.0  0.044  $4146.3\pm 309.2$  $4146.3\pm 309.2$  0.044  0.00  $30.1\pm 36.9$  
3.0  0.01  $3736.3\pm 14.3$  $3736.4\pm 14.3$  0.010  0.00  $73.7\pm 1.9$  
1.0  0.008  $3673.0\pm 17.7$  $3672.9\pm 17.7$  0.008  0.00  $85.2\pm 2.5$  
0.3  0.006  $3569.8\pm 26.4$  $3569.8\pm 26.4$  0.006  0.00  $100.8\pm 3.7$  
0.1  0.003  $3355.8\pm 7.6$  $3355.8\pm 7.6$  0.003  0.00  $151.7\pm 2.4$  
0.03  0.001  $3138.9\pm 10.6$  $3139.0\pm 10.6$  0.001  0.00  $275.4\pm 3.1$  
0.01  0.001  $3126.1\pm 5.0$  $3126.1\pm 5.0$  0.001  0.00  $349.3\pm 5.4$  
0.003  0.001  $3161.4\pm 4.0$  $3161.3\pm 4.0$  0.001  0.00  $373.5\pm 7.5$  
0.001  0.001  $3145.4\pm 6.1$  $3145.4\pm 6.1$  0.001  0.00  $378.4\pm 7.7$ 
Model  ELBO  ${\sigma}^{2}$tuned ELBO  Tuned ${\sigma}^{2}$  Posterior  KL  

Init ${\sigma}^{2}$  Final ${\sigma}^{2}$  collapse (%)  Divergence  
CELEBA 64 
30.0  $79986.2\pm 0.10$  $57883.8\pm 19.3$  0.423  93.68  $26.0\pm 0.19$  
10.0  $73328.4\pm 0.49$  $55186.7\pm 35.1$  0.204  80.56  $56.12\pm 0.42$  
3.0  $66145.6\pm 2.44$  $52828.5\pm 58.6$  0.132  20.64  $120.4\pm 1.37$  
1.0  $59841.8\pm 30.1$  $51294.8\pm 333.7$  0.102  2.52  $213.4\pm 6.3$  
0.3  $54370.4\pm 849.9$  $52155.2\pm 1855.2$  0.122  74.52  $267.2\pm 51.9$  
0.1  $50760.3\pm 353.4$  $50698.5\pm 393.9$  0.0883  32.72  $483.8\pm 36.2$  
0.03  $64322.8\pm 312.9$  $58077.9\pm 206.2$  0.0463  0.00  $1521.1\pm 11.6$  
0.01  $82478.7\pm 1823.3$  $51373.9\pm 213.3$  0.0817  0.00  $1624.2\pm 8.78$  
0.003  $192967.7\pm 4410.4$  $51978.4\pm 159.3$  0.0685  0.00  $2108.4\pm 26.2$  
0.001  $531924.5\pm 17177.6$  $57381.5\pm 512.6$  0.0296  0.00  $2680.2\pm 41.45$  
30.0  0.005  $53179.6\pm 450.2$  $53179.6\pm 450.3$  0.005  0.00  $302.8\pm 29.8$  
10.0  0.004  $51748.5\pm 178.2$  $51748.5\pm 178.2$  0.004  0.00  $482.3\pm 24.7$  
3.0  0.004  $51548.9\pm 154.1$  $51548.9\pm 154.2$  0.004  0.00  $489.5\pm 21.8$  
1.0  0.004  $51356.9\pm 79.1$  $51356.9\pm 79.1$  0.004  0.00  $516.3\pm 18.0$  
0.3  0.004  $51767.7\pm 369.2$  $51767.7\pm 369.1$  0.004  22.00  $439.7\pm 33.3$  
0.1  0.004  $51637.3\pm 163.3$  $51637.1\pm 163.5$  0.004  0.00  $577.3\pm 13.5$  
0.03  0.004  $51792.6\pm 163.4$  $51792.6\pm 163.6$  0.004  45.48  $484.6\pm 22.6$  
0.01  0.004  $51925.1\pm 99.8$  $51924.9\pm 99.8$  0.004  0.00  $627.8\pm 20.6$  
0.003  0.004  $52111.2\pm 149.0$  $52111.0\pm 148.8$  0.004  42.80  $466.9\pm 13.9$  
0.001  0.004  $52060.1\pm 171.8$  $52060.0\pm 171.9$  0.004  0.0  $645.6\pm 19.2$ 
E.1.3 Qualitative Results
Reconstructions from the KLAnnealed CelebA model are shown in Figure 12. We also show the output of interpolating in the latent space in Figure 13. To produce the latter plot, we compute the variational mean of 3 input points (top left, top right, bottom left) and interpolate linearly on the plane between them. We also extrapolate out to a fourth point (bottom right), which lies on the plane defined by the other points.