Adaptive Nonparametric Variational Autoencoder

  • 2019-06-07 18:32:46
  • Tingting Zhao, Zifeng Wang, Aria Masoomi, Jennifer G. Dy
  • 22


Clustering is used to find structure in unlabeled data by grouping similarobjects together. Cluster analysis depends on the definition of similarity inthe feature space. In this paper, we propose an Adaptive NonparametricVariational Autoencoder (AdapVAE) to perform end-to-end feature learning fromraw data jointly with cluster membership learning through a NonparametricBayesian modeling framework with deep neural networks. It has the advantage ofavoiding pre-definition of similarity or feature engineering. Our model relaxesthe constraint of fixing the number of clusters in advance by assigning aDirichlet Process prior on the latent representation in a low-dimensionalfeature space. It can adaptively detect novel clusters when new data arrivesbased on a learned model from historical data in an online unsupervisedlearning setting. We develop a joint online variational inference algorithm tolearn feature representations and cluster assignments via iterativelyoptimizing the evidence lower bound (ELBO). Our experimental resultsdemonstrate the capacity of our modelling framework to learn the number ofclusters automatically using data, the flexibility to detect novel clusterswith emerging data adaptively, the ability of high quality reconstruction andgeneration of samples without supervised information and the improvement overstate-of-the-art end-to-end clustering methods in terms of accuracy on bothimage and text corpora benchmark datasets.


Quick Read (beta)

Adaptive Nonparametric Variational Autoencoder Supplement

Tingting Zhao Electrical and Computer Engineering Department, Northeastern University Zifeng Wang Electrical and Computer Engineering Department, Northeastern University Aria Masoomi Electrical and Computer Engineering Department, Northeastern University Jennifer G. Dy Electrical and Computer Engineering Department, Northeastern University


Adaptive Nonparametric Variational Autoencoder Supplement



noticebox[b]Preprint. Under review.\[email protected]

1 Variational Inference for AdapVAE and ELBO Derivation

In this section, we provide the ELBO derivation. Recall that we use the variational distribution q⁒(π’š,𝒛,Ο•,𝒗|𝒙) to approximate the posterior distribution p⁒(π’š,𝒛,Ο•,𝒗|𝒙). Minimizing the Kullback-Leibler (KL) divergence between q⁒(π’š,𝒛,Ο•,𝒗|𝒙) and p⁒(π’š,𝒛,Ο•,𝒗|𝒙) is equivalent to maximizing the ELBO β„’ELBO. We first list the assumptions on the variational distribution q⁒(π’š,𝒛,Ο•,𝒗|𝒙) and then provide the ELBO derivation and the updating equations. We assume that

q⁒(π’š,𝒛,Ο•,𝒗|𝒙) =qψ⁒(𝒛|𝒙)⁒q⁒(π’š,𝒛,Ο•|𝒙)

Now, we list the variational distribution assumptions for qψ⁒(𝒛|𝒙), q⁒(π’š), q⁒(𝒗) and q⁒(Ο•) respectively.

q⁒(π’š,𝒗,Ο•|𝒙)=∏t=1T-1qΞ·t⁒(vt)⁒∏t=1TqΞΆt⁒(Ο•t)⁒∏n=1Nqρn⁒(yn), (1)

where T is the number of mixture components in the DP of the variational distribution and 𝒛nβˆΌπ’©β’(𝒛n|ΞΌt,Ξ›t-1).

  • β€’


  • β€’

    qΞΆt⁒(Ο•t)=q⁒(ΞΌt|Ξ›t)⁒q⁒(Ξ›t)=𝒩⁒(ΞΌt|mt,(Ξ²t⁒Λt)-1)⁒𝒲⁒(Ξ›t|Wt,Ξ½t), where Ο•t=(ΞΌt,Ξ›t).

  • β€’

    q⁒(yn)=Mult⁒(T,ρn), which is a Multinomial distribution.

  • β€’


Under our assumptions, the β„’ELBO⁒(x) can be rewritten as:

β„’ELBO⁒(x)=𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(𝒙,π’š,𝒛,Ο•,𝒗)q⁒(π’š,𝒛,Ο•,𝒗|𝒙)]=𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(𝒙|𝒛)⁒p⁒(𝒛|π’š,Ο•)⁒p⁒(π’š|𝒗)⁒p⁒(𝒗)⁒p⁒(Ο•)qψ(𝒛|𝒙)q(π’š|𝒙)q(𝒗)q(Ο•|𝒙))]=𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(𝒙|𝒛)]+𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(𝒛|π’š,Ο•)]+𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(π’š|𝒗)]+𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(𝒗)]+𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑p⁒(Ο•)]-𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑qψ⁒(𝒛|𝒙)]-𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑q⁒(π’š|𝒙)]-𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑q⁒(𝒗)]-𝔼q⁒(π’š,𝒛,Ο•,𝒗|𝒙)⁒[log⁑q⁒(Ο•|𝒙)] (2)

In our updating strategy, we adopt an alternating optimization strategy used by goyal2017nonparametric. To be specific, we update the VAE parameters (ΞΈ and ψ) and the latent variable (𝒛) given the current estimates of the DPMM parameters. When updating the DPMM parameters, the latent representation 𝒛 is treated as the observations for DPMM. The updates for the local cluster membership assignment parameter π’š, global parameters Ο• and 𝒗 simplifies to the updates in variational inference for DPMM developed by blei2006variational. Hence, we only list the updating equations and the expectation derivation involving q⁒(π’š), q⁒(Ο•) and q⁒(𝒗) at the end of this section. The notation summary is provided in the main paper. We focus on deriving the nonstandard terms involving the VAE parameters ΞΈ, ψ and latent representation 𝒛 first. (1) 𝔼qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒[log⁑Pθ⁒(𝒙|𝒛)]: We use a neural network g to model the decoder with parameters ΞΈ, where (𝝁𝒙,logβ‘πˆπ’™2)=gθ⁒(𝒛) and Pθ⁒(𝒙|𝒛)=𝒩⁒(𝒙;𝝁𝒙,πˆπ’™2⁒𝑰). Hence, we have

𝔼 [logPΞΈ(𝒙|𝒛)]qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)=𝔼qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)(-12log(πˆπ’™2)j+(𝒙j-(𝝁𝒙)j)2(πˆπ’™2)j)
=-121Lβˆ‘i=1Nβˆ‘j=1D(log(𝝈x2)j(l)+(𝒙i⁒j-(𝝁x)j(l))2(𝝈x2)j(l)), (3)


  • β€’

    (𝝁𝒙)j: represents the jth element of 𝝁𝒙 for j=1,2,…,D.

  • β€’

    (πˆπ’™2)j: represents the jth element of πˆπ’™2 for j=1,2,…,D.

  • β€’

    𝒙i⁒j: represents the jth element of the ith observation.

(2) 𝔼qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒[log⁑p⁒(𝒛|π’š,Ο•)]: We use neural network f to model the encoder with parameters ψ, where qψ⁒(𝒛|𝒙)=𝒩⁒(𝝁⁒(𝒙;ψ),𝝈2⁒(𝒙;ψ)) and (𝝁⁒(𝒙;ψ),log⁑𝝈2⁒(𝒙;ψ))=f⁒(𝒙;ψ). In VAE, we use the reparameterization trick to allow backpropagation:


We denote 𝒛^n as the estimated mean of the latent representation from the encoder given xn:


According to Equation 10.71 of bishop2006pattern, we have the following:

𝔼qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒[log⁑p⁒(𝒛|π’š,Ο•)]=12β’βˆ‘k=1TNk⁒{log⁑Λ~k-D⁒βk-1-Ξ½k⁒Tr⁒(Sk⁒Wk)-Ξ½k⁒(zΒ―k-π’Žk)T⁒Wk⁒(zΒ―k-π’Žk)-D⁒log⁑(2⁒π)}, (4)


𝔼ϕk⁒[(𝒛^n-ΞΌk)T⁒Λk⁒(𝒛^n-ΞΌk)]=DΞ²k+Ξ½k⁒(𝒛^n-π’Žk)T⁒Wk⁒(𝒛^n-π’Žk)log⁑Λ~k=𝔼⁒[log⁑Λk]=βˆ‘j=1Dψ⁒(Ξ½k+1-i2)+D⁒log⁑2+log⁑|Wk|. (5)

(3) 𝔼⁒qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒(log⁑qψ⁒(𝒛|𝒙)): We assume that qψ⁒(𝒛|𝒙)=𝒩⁒(𝝁⁒(𝒙;ψ),𝝈2⁒(𝒙;ψ)). Hence, 𝔼⁒qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒(log⁑qψ⁒(𝒛|𝒙)) is equal to the negative entropy of a multivariate Gaussian distribution, which is:

𝔼⁒qψ⁒(𝒛|𝒙)⁒q⁒(π’š)⁒q⁒(𝒗)⁒q⁒(Ο•)⁒(log⁑qψ⁒(𝒛|𝒙))=12⁒log⁑(Det⁒(2⁒π⁒e⁒Σ)) (6)

where Ξ£=𝝈2⁒(𝒙;ψ)⁒I. When we update the VAE parameters ΞΈ and ψ and the latent representation 𝒛, the DPMM parameters will be fixed. Thus, the terms that do not involve 𝒛, ΞΈ, ψ will not contribute to the LELBO. Hence, by combining EquationΒ 3, 4 and 6, we obtain

β„’ELBO⁒(𝒙) =-12⁒Lβ’βˆ‘l=1Lβˆ‘k=1TNk⁒νk⁒{Tr⁒(Sk⁒Wk)+(zΒ―k-π’Žk)T⁒Wk⁒(zΒ―k-π’Žk)}
-121Lβˆ‘i=1Nβˆ‘j=1D(log(𝝈x2)j(l)+(𝒙i⁒j-(𝝁x)j(l))2/(𝝈x2)j(l))+12log(Det(2Ο€eΞ£)). (7)

Here, we list the standard variational inference updating equations and derivations for DPMM.

  • β€’


  • β€’


  • β€’


  • β€’


  • β€’


  • β€’


  • β€’


  • β€’

    Under the Gaussian-Wishart distribution assumption,



  • β€’

    Similarly, we have




A summary of notations for deriving the ELBO is listed below.

Table 1: Notations in the ELBO.
Notations in the ELBO
N: the total number of observations.
L: the number of samples we obtain when using SGVB and representation trick for VAE.
Ξ£: the diagonal covariance matrix of the encoder.
𝒙n: the nth observation.
p(π’ši=k)=Ξ³i⁒k: the probability of the ith observation in the kth cluster.
Nk=βˆ‘n=1NΞ³n⁒k: the sum of the probability of the total N observations in the kth cluster.
𝒛^n=𝝁⁒(𝒙n;ψ): the estimated mean of the latent representation from the encoder given 𝒙n.
𝒛¯k=1Nkβ’βˆ‘n=1NΞ³n⁒k⁒𝒛^n: the weighted mean of the kth cluster in the latent representation space.
Ξ²k=Ξ²0+Nk: the scalar precision for 𝒛k in the Normal-Wishart distribution.
π’Žk=1Ξ²k⁒(Ξ²0β’π’Ž0+Nk⁒zΒ―k): the updated formula for the mean of the Normal-Wishart distribution for the
kth cluster, where the prior π’Ž0 is usually chosen as a zero vector.
Wk-1=W0-1+Nk⁒Sk+Ξ²0⁒NkΞ²0+Nk⁒(𝒛¯k-π’Ž0)⁒(𝒛¯k-π’Ž0)T: the updated formula for Wk-1, the parameter
of Normal-Wishart distribution.
Ξ½k=Ξ½0+Nk: updated degrees of freedom of the Normal-Wishart distribution for the kth cluster.
Ο•: the variational parameters of the Normal Wishart mixing components in the DP.
Ξ·t: the variational parameters of a Beta distribution for the tth component defined in EquationΒ 1.
ΞΆt: the variational parameters of the Normal-Wishart distribution for Ο•t defined in EquationΒ 1.
ρn: the variational parameters of the categorical distribution for the cluster membership for each observation.

2 Benchmark Datasets Description

  • β€’

    MNIST: The MNIST dataset consists images of 70000 handwritten digits of 28Γ—28 pixel size. In order to compare fairly with previous methods, we did normalization and flattened each image to a 784Γ—1 vector.

  • β€’

    STL-10: The STL-10 dataset consists of color images of 96Γ—96 pixel size. There are 10 classes with 1300 examples each. Following previous works, we fed original images to ResNet (he2016deep) pretrained on ImageNet (deng2009imagenet) and used the last feature map after the 3Γ—3 average pooling layer. So the extracted feature is of size 2048Γ—1.

  • β€’

    REUTERS: The Reuters dataset contains about 810000 English news stories labeled with a category tree. Following previous works, we just used four root categories corporate/industrial, government/social, markets, and economics as labels and discard articles have multiple labels to get 685071 articles. We then randomly sampled a subset of 10000 articles called call REUTERS-10K. As our method are scalable by its online nature, we mainly experimented on REUTERS-10K.

3 Evaluation metrics

  • β€’

    Normalized Mutual Information (NMI) is a normalized metric for determining the quality of clustering. It can be used to compare different clusterings with different number of clusters. Its range is between zero and one, which represents no mutual information and perfect correlation. The NMI is defined as follows:


    where l is the ground-truth label, c is the cluster assignment by the algorithm, I and H represents mutual information and entropy respectively (the definition for I and H is the same among all the following metrics).

  • β€’

    Adjusted Random Index (ARI) ranges between zero and one. If it is close to zero, it represents random labeling independently of the number of clusters and samples; it equals to one when the clusterings are identical as the true one (up to a permutation). Given a set S of n samples, where C is the set of true classes, C={ci|i=1,…,nc} and K is the set of clusters, K={ki|i=1,…,nk}. Define A to be the contingency table produced by the clustering algorithm such that every element ai⁒j in A represents the number of samples that are members of class ci and elements of cluster kj. Therefore, the ARI is defined as follows according to hubert1985comparing:


    where ai=βˆ‘jai⁒j and bj=βˆ‘iai⁒j.

  • β€’

    Homogeneity Score (HS) is a homogeneity metric of a cluster labeling given the ground truth. A clustering satisfies homogeneity (with value one) if all of its clusters contain only data points which are members of a single class. In rosenberg2007v, they assume n is number of observations and share the same definition of C, K and A as in ARI. They define homogeneity as:

    h={1if H⁒(C,K)=0,1-H⁒(C|K)H⁒(C)else,


    Since H⁒(C|K)≀H⁒(C), the value of h is between zero and one. In the degenerate case where H⁒(C)=0, they define h to be 1.

  • β€’

    V-measure score (VM) is a metric to measure the agreenment of two independent clusterings on the same dataset. Its range is between zero and one where one stands for perfect complete clustering as the ground truth. V-measure is the weighted harmonic mean of homogeneity and completeness. rosenberg2007v define the completeness measure as follows, which is symmetrical to homogeneity defined as HS previously (definitions of parameters are the same as in HS):

    c={1if H⁒(K,C)=0,1-H⁒(K|C)H⁒(K)else,




    Similarly, in the degenerate case where H⁒(K)=0, they define c to be 1.
    The V-measure is defined as follows:


    Where Ξ² is the weighting factor. Note that if Ξ² is greater than one, completeness is weighted more strongly; if Ξ² is less than one, homogeneity is weighted more strongly.

4 Figures

Figure 1: We provide the tSNE (maaten2008visualizing) plot by transforming the ten dimensional latent representation to a two dimensional space for visualization. Different colors indicate the ground-truth classes corresponding to the fitted classes from our algorithm. The number of true clusters is nine (since the dataset does not include number two) and we obtain ten fitted clusters where the β€œmixture" cluster is composed of samples from multiple classes. This figure is best viewed in color.

5 Tables

Table 2: Clustering quality (%) comparison on benchmark datasets averaged over five replications. Both the average value and the standard error (in the parenthesis) are provided.
Dataset Method NMI ARI HS VM ACC
MNIST DEC 84.67 (2.25) 83.67 (4.53) 84.67 (2.25) 84.67 (2.25) 91.04 (3.39)
VaDE 80.35 (4.68) 74.06 (9.11) 79.86 (4.93) 80.36 (4.69) 80.87 (9.15)
VAE+DP 81.70 (0.825) 70.49 (1.654) 91.27 (0.215) 81.19 (0.904) 94.23 (0.183)
AdapVAE 85.72 (1.02) 83.53 (2.35) 89.34 (0.25) 85.65 (0.51) 93.85 (0.51)
Reuters10k DEC 46.56 (5.36) 46.86 (7.98) 48.44 (5.44) 46.52 (5.36) 67.27 (4.56)
VaDE 41.64 (4.73) 38.49 (5.44) 43.64 (4.88) 41.60 (4.73) 63.8 (5.08)
VAE + DP 41.62 (2.99) 37.93 (4.57) 46.64 (3.85) 41.34 (2.94) 78.14 (2.25)
AdapVAE 45.32 (1.79) 42.66 (5.73) 48.88 (1.86) 45.40 (2.04) 78.32 (1.79)
STL10 DEC 71.92 (2.66) 58.73 (5.09) 68.47 (3.48) 71.83 (2.72) 66.89 (4.26)
VaDE 68.35 (3.85) 59.42 (6.84) 67.24 (4.23) 68.37 (3.92) 72.10 (7.40)
VAE+DP 43.18 (1.41) 26.58 (1.32) 42.28 (1.03) 43.16 (1.39) 55.21 (1.45)
AdapVAE 75.26 (0.53) 70.72 (0.81) 77.61 (1.29) 75.22 (0.52) 85.28 (1.40)
Table 3: Precision and recall for each cluster using the "Historical" data without zero or two on MNIST. Number seven has the major contribution to the mixture cluster.
Precision Recall TrueClass FittedClass
0.999 0.914 1 1
0.908 0.962 3 0
0.952 0.97 4 3
0.996 0.837 5 7
0.997 0.901 6 5
0.995 0.78 7 6
0.909 0.935 8 3
0.992 0.885 9 4
0.4 0.214 Mixture (7) 8
Table 4: Precision and recall for each cluster using the "First" dataset with number zero as the novel class on MNIST. Number seven has the major contribution to the mixture cluster.
Precision Recall TrueClass FittedClass
0.999 0.937 0 0
0.9986 0.9127 1 2
0.9095 0.9473 3 1
0.972 0.975 4 3
0.973 0.923 5 8
0.995 0.932 6 5
0.994 0.741 7 7
0.973 0.923 8 6
0.996 0.921 9 4
0.333 0.254 Mixture (7) 9
Table 5: Precision and recall for each cluster using the "Second" dataset with number two as the novel class on MNIST. Number seven has the major contribution to the mixture class.
Precision Recall TrueClass FittedClass
0.998 0.96 0 1
1.000 0.904 1 5
0.976 0.967 2 0
0.918 0.91 3 4
0.987 0.964 4 8
0.991 0.834 5 7
0.995 0.937 6 2
0.992 0.712 7 10
0.942 0.948 8 3
0.989 0.939 9 6
0.366 0.283 Mixture (7) 9