Abstract
We propose a novel unsupervised generative model, ElasticInfoGAN, thatlearns to disentangle object identity from other lowlevel aspects inclassimbalanced datasets. We first investigate the issues surrounding theassumptions about uniformity made by InfoGAN, and demonstrate itsineffectiveness to properly disentangle object identity in imbalanced data. Ourkey idea is to make the discovery of the discrete latent factor of variationinvariant to identitypreserving transformations in real images, and use thatas the signal to learn the latent distribution's parameters. Experiments onboth artificial (MNIST) and realworld (YouTubeFaces) datasets demonstrate theeffectiveness of our approach in imbalanced data by: (i) better disentanglementof object identity as a latent factor of variation; and (ii) betterapproximation of class imbalance in the data, as reflected in the learnedparameters of the latent distribution.
Quick Read (beta)
ElasticInfoGAN: Unsupervised Disentangled Representation Learning in Imbalanced Data
Abstract
We propose a novel unsupervised generative model, ElasticInfoGAN, that learns to disentangle object identity from other lowlevel aspects in classimbalanced datasets. We first investigate the issues surrounding the assumptions about uniformity made by InfoGAN (Chen et al. (2016)), and demonstrate its ineffectiveness to properly disentangle object identity in imbalanced data. Our key idea is to make the discovery of the discrete latent factor of variation invariant to identitypreserving transformations in real images, and use that as the signal to learn the latent distribution’s parameters. Experiments on both artificial (MNIST) and realworld (YouTubeFaces) datasets demonstrate the effectiveness of our approach in imbalanced data by: (i) better disentanglement of object identity as a latent factor of variation; and (ii) better approximation of class imbalance in the data, as reflected in the learned parameters of the latent distribution.
ElasticInfoGAN: Unsupervised Disentangled Representation Learning in Imbalanced Data
1 Introduction
Generative models aim to model the true data distribution, so that fake samples that seemingly belong to the modeled distribution can be generated (Ackley et al. (1985); Rabiner (1989); Blei et al. (2003)). Recent deep neural network based models such as Generative Adversarial Networks (Goodfellow et al. (2014); Salimans et al. (2016); Radford et al. (2016)) and Variational Autoencoders (Kingma and Welling (2014); Higgins et al. (2017)) have led to promising results in generating realistic samples for highdimensional and complex data such as images. More advanced models show how to discover disentangled representations (Yan et al. (2016); Chen et al. (2016); Tran et al. (2017); Hu et al. (2018); Singh et al. (2019)), in which different latent dimensions can be made to represent independent factors of variation (e.g., pose, identity) in the data (e.g., human faces).
InfoGAN (Chen et al. (2016)) in particular, tries to learn an unsupervised disentangled representation by maximizing the mutual information between the discrete or continuous latent variables and the corresponding generated samples. For discrete latent factors (e.g., digit identities), it assumes that they are uniformly distributed in the data, and approximates them accordingly using a fixed uniform categorical distribution. Although this assumption holds true for many existing benchmark datasets (e.g., MNIST LeCun (1998)), realword data often follows a longtailed distribution and rarely exhibits perfect balance between the categories. Indeed, applying InfoGAN on imbalanced data can result in incoherent groupings, since it is forced to discover potentially nonexistent factors that are uniformly distributed in the data; see Fig. 1.
In this work, we augment InfoGAN to discover disentangled categorical representations from imbalanced data. Our model, ElasticInfoGAN, makes two modifications to InfoGAN which are simple and intuitive. First, we remodel the way the latent distribution is used to fetch the latent variables; we lift the assumption of any knowledge about class imbalance, where instead of deciding and fixing them beforehand, we treat the class probabilities as learnable parameters of the optimization process. To enable the flow of gradients back to the class probabilities, we employ the GumbelSoftmax distribution (Jang et al. (2017); Maddison et al. (2017)), which acts as a proxy for the categorical distribution, generating differentiable samples having properties similar to that of categorical samples. Second, we enforce our network to assign the same latent category for an image $I$ and its transformed image ${I}^{\prime}$, which induces the discovered latent factors to be invariant to identitypreserving transformations like illumination, translation, rotation, and scale changes. Although there are multiple meaningful ways to partition unlabeled data—e.g., with digits, one partitioning could be based on identity, whereas another could be based on stroke width—we aim to discover the partitioning that groups objects according to a highlevel factor like identity while being invariant to lowlevel “nuisance” factors like lighting, pose, and scale changes. Such partitionings focusing on object identity are more likely to be useful for downstream visual recognition applications (e.g., semisupervised object recognition). In sum, our modifications to InfoGAN lead to better disentanglement and categorical grouping of the data (Fig. 1), while at the same time enabling the discovery of the original imbalance through the learned probability parameters of the Gumbel softmax distribution. Importantly, these modifications do not impede InfoGAN’s ability to jointly model both continuous and discrete factors in either balanced or imbalanced data scenarios.
Our contributions can be summarized as follows: (1) To our knowledge, our work is the first to tackle the problem of unsupervised generative modeling of categorical disentangled representations in imbalanced data. We show qualitatively and quantitatively our superiority in comparison to InfoGAN and other relevant baselines. (2) Our work takes a step forward in the direction of modeling real data distributions, by not only explaining what modes of a factor of variation are present in the data, but also discovering their respective proportions.
2 Related Work
Disentangled representation learning
Learning disentangled representations of the data has a vast literature (Hinton et al. (2011); Vincent (2013); Yan et al. (2016); Chen et al. (2016); Mathieu et al. (2016); Tran et al. (2017); Denton and Birodkar (2017); Hu et al. (2018); Singh et al. (2019)). InfoGAN (Chen et al. (2016)) is one of the most popular unsupervised GAN based disentanglement methods, which learns disentanglement by maximizing the mutual information between the latent codes and generated images. It has shown promising results for discovering meaningful latent factors in balanced datasets like MNIST (LeCun (1998)), CelebA (Liu et al. (2015)), and SVHN (Netzer et al. (2011)). The recent method of JointVAE (Dupont (2018)) extends betaVAE (Higgins et al. (2017)) by jointly modeling both continuous and discrete factors, using GumbelSoftmax sampling. However, both InfoGAN and JointVAE assume uniformly distributed data, and hence fail to be equally effective in imbalanced data, evident by Fig. 1 and our experiments. Our work proposes modifications to InfoGAN to enable it to discover meaningful latent factors in imbalanced data.
Learning from imbalanced data
Real world data have a longtailed distribution (Guo et al. (2016); Van Horn et al. (2018)), which can impede learning, since the model can get biased towards the dominant categories. To alleviate this issue, researchers have proposed resampling (Chawla et al. (2002); He et al. (2008); Shen et al. (2016); Buda et al. (2018); Zou et al. (2018)) and class reweighting techniques (Ting (2000); Huang et al. (2016); Dong et al. (2017); Mahajan et al. (2018)) to oversample rare classes and downweight dominant classes. These methods have shown to be effective for the supervised setting, in which the class distributions are known a priori. There are also unsupervised clustering methods that deal with imbalanced data in unknown class distributions (e.g., Nguwi and Cho (2010); You et al. (2018)). Our model works in the same unsupervised setting; however, unlike these methods, we propose an unsupervised generative model method that learns to disentangle latent categorical factors in imbalanced data.
Leveraging data augmentation for unsupervised image grouping
Some works (Hui (2013); Dosovitskiy et al. (2015); Hu et al. (2017); Ji et al. (2019)) use data augmentation for image transformation invariant unsupervised clustering or representation learning. The main idea is to maximize the mutual information or similarity between the features of an image and its corresponding transformed image. However, unlike our approach, these methods do not target imbalanced data and do not perform generative modeling.
3 Approach
Let $\mathcal{X}=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{N}\}$ be a dataset of $N$ unlabeled images from $k$ different classes. No knowledge about the nature of class imbalance is known beforehand. Our goal is twofold: (i) learn a generative model $G$ which can learn to disentangle object category from other aspects (e.g., digits in MNIST (LeCun (1998)), face identity in YouTubeFaces (Wolf et al. (2011))); (ii) recover the unknown true class imbalance distribution via the generative modeling process. In the following, we first briefly discuss InfoGAN (Chen et al. (2016)), which addressed this problem for the balanced setting. We then explain how InfoGAN can be extended to the scenario of imbalanced data.
3.1 Background: InfoGAN
Learning disentangled representations using the GAN (Goodfellow et al. (2014)) framework was introduced in InfoGAN (Chen et al. (2016)). The intuition is for generated samples to retain the information about latent variables, and consequently for latent variables to gain control over certain aspects of the generated image. In this way, different types of latent variables (e.g., discrete categorical vs. continuous) can control properties like discrete (e.g., digit identity) or continuous (e.g., digit rotation) variations in the generated images.
Formally, InfoGAN does this by maximizing the mutual information between the latent code $c$ and the generated samples $G(z,c)$, where $z\sim {P}_{noise}(z)$ and $G$ is the generator network. The mutual information $I(c,G(c,z))$ can then be used as a regularizer in the standard GAN training objective. Computing $I(c,G(c,z))$ however, requires $P(cx)$, which is intractable and hard to compute. The authors circumvent this by using a lower bound of $I(c,G(c,z))$, which can approximate $P(cx)$ via a neural network based auxiliary distribution $Q(cx)$. The training objective hence becomes:
$\underset{G,Q}{\mathrm{min}}\underset{D}{\mathrm{max}}{V}_{InfoGAN}(D,G,Q)$  $={V}_{GAN}(D,G){\lambda}_{1}{L}_{1}(G,Q),$  (1)  
${L}_{1}(G,Q)$  $={E}_{c\sim P(c),x\sim G(z,c)}[\mathrm{log}Q(cx)]+H(c),$  (2) 
where $D$ is the discriminator network, and $H(c)$ is the entropy of the latent code distribution. Training with this objective results in latent codes $c$ having control over the different factors of variation in the generated images $G(z,c)$. To model discrete variations in the data, InfoGAN employs nondifferentiable samples from a uniform categorical distribution with fixed class probabilities; i.e., $c\sim Cat(K=k,p=1/k)$ where $k$ is the number of discrete categories to be discovered.
3.2 ElasticInfoGAN
As shown in Fig. 1, applying InfoGAN to an imbalanced dataset results in suboptimal disentanglement, since the uniform prior assumption does not match the actual groundtruth data distribution of the discrete factor (e.g., digit identity). To address this, we propose two augmentations to InfoGAN. The first is to enable learning of the latent distribution’s parameters (class probabilities), which requires gradients to be backpropagated through latent code samples $c$, and the second is to enforce identitypreserving transformation invariance in the learned latent variables so that the resulting disentanglement favors groups that coincide with object identities.
Learning the prior distribution
To learn the prior distribution, we replace the fixed categorical distribution in InfoGAN with the GumbelSoftmax distribution (Jang et al. (2017); Maddison et al. (2017)), which enables sampling of differentiable samples. The continuous GumbelSoftmax distribution can be smoothly annealed into a categorical distribution. Specifically, if ${p}_{1},{p}_{2}\mathrm{\dots},{p}_{k}$ are the class probabilities, then sampling of a $k$dimensional vector $c$ can be done in a differentiable way:
$${c}_{i}=\frac{\text{exp}((\mathrm{log}({p}_{i})+{g}_{i})/\tau )}{{\sum}_{j=1}^{k}\text{exp}((\mathrm{log}({p}_{j})+{g}_{j})/\tau )}\mathit{\hspace{1em}\hspace{1em}}\text{for}i=1,\mathrm{\dots},k.$$  (3) 
Here ${g}_{i},{g}_{j}$ are samples drawn from $Gumbel(0,1)$, and $\tau $ (softmax temperature) controls the degree to which samples from GumbelSoftmax resemble the categorical distribution. Low values of $\tau $ make the samples possess properties close to that of a onehot sample.
In theory, InfoGAN’s behavior in the class balanced setting (Fig. 1 left) can be replicated in the imbalanced case (where grouping becomes incoherent, Fig. 1 center), by simply replacing the fixed uniform categorical distribution with GumbelSoftmax with learnable class probabilities ${p}_{i}$’s; i.e. gradients can flow back to update the class probabilities (which are uniformly initialized) to match the true class imbalance. And once the true imbalance gets reflected in the class probabilities, the possibility of proper categorical disentanglement (Fig. 1 right) becomes feasible.
Empirically, however, this ideal behavior is not observed in a consistent manner. As shown in Fig. 3 (left), unsupervised grouping can focus on noncategorical attributes such as rotation of the digit. Although this is one valid way to group unlabeled data, our goal in this work is to prefer groupings that correspond to class identity as in Fig. 3 (right).
Learning object identities
To capture object identity as the factor of variation, we make another modification to InfoGAN. Specifically, to make the model focus on high level object identity and be invariant to low level factors like rotation, thickness, illumination, etc., we explicitly create these identitypreserving transformations on real images, and enforce the latent prediction $Q(cx)$ to be invariant to these transformations. Note that such transformations (aka data augmentations) are standard for learning invariant representations for visual recognition tasks.
Formally, for any real image $x\sim {P}_{data}(x)$, we apply a set of transformations $\delta $ to obtain a transformed image ${x}^{\prime}=\delta (x)$. It is important to point out that these transformations are not learned over the optimization process. Instead we use fixed simple transformations which guarantee that the human defined object identity label for the original image $x$ and the transformed image ${x}^{\prime}$ image remain the same. For example, the digit identity of a ‘one’ from MNIST will remain the same if a transformation of rotation ($\pm $10 degree) is applied. Similarly, a face identity will remain the same upon horizontal flipping. We hence formulate our transformation constraint loss function:
$${L}_{trans}(Q)=\U0001d5bd(Q({c}_{x}x),Q({c}_{{x}^{\prime}}{x}^{\prime}))$$  (4) 
where $\U0001d5bd(\cdot )$ is a distance metric (e.g., cosine distance), and $Q({c}_{x}x)$, $Q({c}_{{x}^{\prime}}{x}^{\prime})$, are the latent code predictions for real image $x$ and transformed image ${x}^{\prime}$, respectively. Note that ideally $Q(cx)$, for either $x\sim {P}_{data}(x)$ or $x\sim {P}_{g}(G)$, should have low entropy (peaky class distribution) for proper inference about the latent object category. Eq. 2 automatically enforces a peaky class distribution for $Q(cx)$ for $x\sim {P}_{g}(G)$, because the sampled input latent code $c$ from GumbelSoftmax is peaky. For $x\sim {P}_{data}(x)$ though, Eq. 4 alone isn’t sufficient as it can be optimized in a suboptimal manner (e.g., if ${c}_{x}\approx {c}_{{x}^{\prime}}$, but both have high entropy). We hence add an additional entropy loss which forces ${c}_{x}$ and ${c}_{{x}^{\prime}}$ to have low entropy ($\U0001d5cc$) class distributions:
$${L}_{ent}(Q)=\U0001d5cc(Q({c}_{x}x))+\U0001d5cc(Q({c}_{{x}^{\prime}}{x}^{\prime})).$$  (5) 
The losses ${L}_{trans}$ and ${L}_{ent}$, along with GumbleSoftmax, constitute our overall training objective:
$$\underset{G,Q}{\mathrm{min}}\underset{D}{\mathrm{max}}{L}_{final}={V}_{InfoGAN}(D,G,Q)+{\lambda}_{2}{L}_{trans}(Q)+{\lambda}_{3}{L}_{ent}(Q).$$  (6) 
${V}_{InfoGAN}$ plays the role of generating realistic images and associating the latent variables to correspond to some factor of variation in the data, while the addition of ${L}_{trans}$ will push the discovered factor of variation to be close to object identity. Finally, ${L}_{ent}$’s objective is to ensure $Q$ behaves similarly for real and fake image distributions. The latent codes sampled from Gumbelsoftmax, generated fake images, and losses operating on fake images are all functions of class probabilities ${p}_{i}$’s too. Thus, during the minimization phase of Eqn. 6, the gradients are used to optimize the class probabilities along with $G$ and $Q$ in the backward pass.
4 Experiments
In this section, we perform quantitative and qualitative analyses to demonstrate the advantage of ElasticInfoGAN in discovering categorical disentanglement for imbalanced datasets.
4.1 Datasets
We use: (1) MNIST (LeCun (1998)) and (2) YouTubeFaces (Wolf et al. (2011)). MNIST is by default a balanced dataset with 70k images, with a similar number of training samples for each of 10 classes. We artificially introduce imbalance over 50 random splits (max imbalance ratio 10:1 between the largest and smallest class). YouTubeFaces is a real world imbalanced video dataset with varying number of training samples (frames) for the 40 face identity classes (as used in Shah and Koltun (2018)). The smallest/largest class has 53/695 images, with a total of 10,066 tightlycropped face images. All results are reported over the average of: (i) 50 runs (over 50 random imbalances) for MNIST, (ii) 5 runs over the same imbalanced dataset for YouTubeFaces.^{1}^{1} 1 The imbalance statistics for all datasets are provided in the appendix.
We use MNIST to provide a proofofconcept of our approach. For example, one of the ways in which different ‘ones’ in MNIST vary is rotation, which can be used as a factor (as opposed to object identity) to group data in imbalanced cases (recall Fig. 3 left). Thus, using rotation as a transformation in ${L}_{trans}$ should alleviate this problem. We ultimately care most about the YouTubeFaces results since it is more representative of real world data, both in terms of challenging visual variations (e.g., facial pose, scale, expression, and lighting changes) a well as inherent class imbalance. For this reason, the effect of augmentations in ${L}_{trans}$ will be more reflective of how well our model can work in real world data.
4.2 Baselines and Evaluation Metrics
We design different baselines to show the importance of having learnable priors for different latent variables and applying our transformation constraints.

•
Uniform InfoGAN (Chen et al. (2016)): This is the original InfoGAN with fixed and uniform categorical distribution.

•
Groundtruth InfoGAN: This is InfoGAN with a fixed, but imbalanced categorical distribution where the class probabilities reflect the groundtruth class imbalance.

•
Groundtruth InfoGAN + Transformation constraint: Similar to the previous baseline but with our data transformation constraint (${L}_{trans}$).

•
Gumbelsoftmax: In this case, InfoGAN does not have a fixed prior for the latent variables. Instead, the priors are learned using the Gumbelsoftmax technique (Jang et al. (2017)).

•
Gumbelsoftmax + Transformation constraint: Apart from having a learnable prior we also apply our transformation constraint (${L}_{trans}$). This is a variant of our final approach that does not have the entropy loss (${L}_{ent}$).

•
Gumbelsoftmax + Transformation constraint + Entropy Loss (ElasticInfoGAN): This is our final model with all the losses, ${L}_{trans}$ and ${L}_{ent}$, in addition to ${V}_{InfoGAN}(D,G,Q)$.

•
JointVAE (Dupont (2018)): We also include this VAE based baseline, which performs joint modeling of disentangled discrete and continuous factors.
Our evaluation should capture: (1) how well we learn classspecific disentanglement for the imbalanced dataset, and (2) recover the groundtruth class distribution of the imbalanced dataset. To capture these aspects, we apply three evaluation metrics:

•
Average Entropy (ENT): Evaluates two properties: (i) whether the images generated for a given categorical code belong to the same groundtruth class i.e., whether the groundtruth class histogram for images generated for each categorical code has a low entropy; (ii) whether each groundtruth class is associated with a single unique categorical code. We generate 1000 images for each of the $k$ latent categorical codes, compute class histograms using a pretrained classifier^{2}^{2} 2 We train the classifier by creating a split of training/validation (80/20) on a per class basis. Classification accuracies: (i) MNIST  98%, (ii) YoutTubeFaces  96%. See appendix for details. to get a $k\times k$ matrix (where rows index latent categories and columns index groundtruth categories). We report the average entropy across the rows (tests (i)) and columns (tests (ii)).

•
Normalized Mutual Information (NMI) (Xu et al. (2003)): We treat our latent category assignments of the fake images (we generate 1000 fake images for each categorical code) as one clustering, and the category assignments of the fake images by the pretrained classifier as another clustering. NMI measures the correlation between the two clusterings. The value of NMI will vary between 0 to 1; higher the NMI, stronger the correlation.

•
Root Mean Square Error (RMSE) between predicted and actual class distributions: measures the accuracy of approximating the true class distribution of the imbalanced dataset. Since the learned latent distribution may not be aligned to the groundtruth distribution (e.g., the first dimension for the learned distribution might capture 9’s in MNIST whereas the first dimension for the groundtruth distribution may be for 0’s), we need a way to align the two. For this, we use the pretrained classifier to classify the generated images for a latent variable and assign the variable to the most frequent class. If more than one latent variable is assigned to the same class, then their priors are added before computing its distance with the known prior of the groundtruth class.
4.3 Implementation Details
Transformations ($\delta $) used: (i) MNIST: Rotation ($\pm 10$ deg) + Zoom ($\pm 0.1\times $); (ii) YouTubeFaces: Random flipping + Random cropping (scale image by $1.1\times $ and crop $64\times 64$ patch) + Gamma contrast (gamma $\sim U(0.3,4.0)$). Additional details are in Appendix.
MNIST  YouTubeFaces  
NMI  ENT  NMI  ENT  
JointVAE  0.6801  0.7006  0.4384  1.7203 
Uniform InfoGAN  0.7765  0.4569  0.6729  1.0299 
Groundtruth InfoGAN  0.7827  0.4196  0.6832  0.9577 
Groundtruth InfoGAN + Transformation constraint  0.7926  0.3965  0.7349  0.8392 
Gumbelsoftmax  0.8360  0.3260  0.7704  0.7561 
Gumbelsoftmax + Transformation constraint  0.8678  0.2585  0.7572  0.7229 
ElasticInfoGAN (Ours)  0.8778  0.2348  0.7768  0.7240 
MNIST  YouTubeFaces  
Gumbelsoftmax  0.03207  0.02118 
Gumbelsoftmax + Transformation constraint  0.03283  0.01732 
ElasticInfoGAN (Ours)  0.02699  0.01552 
4.4 Quantitative Evaluation
We first evaluate disentanglement quality as measured by NMI and average entropy (ENT); see Table 1. ElasticInfoGAN consistently outperforms InfoGAN, JointVAE, and other baselines. In particular, our full model obtains significant boosts of 0.101 and 0.104 in NMI, and 0.222 and 0.305 in ENT compared to the Uniform InfoGAN baseline for MNIST and YouTubeFaces, respectively. The boost is even more significant when compared to JointVAE: 0.1977, 0.3380 in NMI, and 0.4658, 0.9963 in ENT for MNIST and YouTubeFaces, respectively. This again is a result of the assumption of a uniform categorical prior by JointVAE, along with poorer quality generations. We see that our transformation constraint generally improves the performance for both when the groundtruth prior is known (Groundtruth InfoGAN vs. Groundtruth InfoGAN + Transformation constraint) as well as when the prior is learned (Gumbelsoftmax vs. Gumbelsoftmax + Transformation constraint). This shows that enforcing the network to learn groupings that are invariant to identitypreserving transformations helps it to learn a disentangled representation in which the latent dimensions correspond more closely to identitybased classes.
Also, learning the prior using the Gumbelsoftmax leads to better categorical disentanglement than fixed uniform priors, which demonstrates the importance of learning the prior distribution in imbalanced data. Overall, our approach using Gumbelsoftmax to learn the latent prior distribution together with our transformation constraint works better than applying them individually, which demonstrates their complementarity. Interestingly, using a fixed groundtruth prior (Groundtruth InfoGAN) does not result in better disentanglement than learning the prior (Gumbelsoftmax). This requires further investigation, but we hypothesis that having a rigid prior makes optimization more difficult compared to allowing the network to converge to a distribution on its own, as there are multiple losses that need to be simultaneously optimized.
Finally, in Table 2, we evaluate how well the Gumbelsoftmax can recover the groundtruth prior distribution. For this, we compute the RMSE between the learned prior distribution and groundtruth prior distribution. Our full model (transformation constraint + entropy loss) produces the best estimate of the true class imbalance for both datasets, as evident through lowest RMSE. Our improvement over the GumbelSoftmax baseline indicates the importance of our tranformation ${L}_{trans}$ and entropy ${L}_{ent}$ losses in approximating the class imbalance.
4.5 Qualitative Evaluation
We next qualitatively evaluate the disentanglement achieved by our approach. Figs. 4, 5, and 7 show results for MNIST and YouTubeFaces. Overall, ElasticInfoGAN generates more consistent images for each latent code compared to Uniform InfoGAN and JointVAE. For example, in Fig. 4, ElasticInfoGAN only generates inconsistent images in the second row whereas the baseline approaches generate inconsistent images in several rows. Similarly, in Fig. 7, ElasticInfoGAN generates faces of the same person corresponding to a latent variable more consistently than the baselines. Both Uniform InfoGAN and JointVAE on the other hand tend to mix up identities within the same categorical code because they incorrectly assume a prior uniform distribution.
4.6 Modeling continuous factors
Finally, we demonstrate that ElasticInfoGAN does not impede modeling of continuous factors in the imbalanced setting. Specifically, one can augment the input with continuous latent codes (e.g. r1, r2 $\sim $ Unif(1, 1)) along with the existing categorical and noise vectors. In Fig. 6, we show the results of continuous code interpolation; we can see that each of the two continuous codes largely captures a particular continuous factor (stroke width on left, and digit rotation on the right).
5 Conclusion
In this work, we proposed a new unsupervised generative model that learns categorical disentanglement in imbalanced data. Our model learns the class distribution of the imbalanced data and enforces invariance to be learned in the discrete latent variables. Our results demonstrate superior performance over alternative baselines. We hope this work will motivate other researchers to pursue this interesting research direction in generative modeling of imbalanced data.
6 Acknowledgments
This work was supported in part by NSF IIS1751206, IIS1748387, AWS ML Research Award, Google Cloud Platform research credits, and Adobe Data Science Research Award.
References
 A learning algorithm for boltzmann machines. Cognitive science. Cited by: §1.
 Latent dirichlet allocation. JMLR. Cited by: §1.
 A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks. Cited by: §2.
 SMOTE: synthetic minority oversampling technique. JAIR. Cited by: §2.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §A.1, §A.1, ElasticInfoGAN: Unsupervised Disentangled Representation Learning in Imbalanced Data, §1, §1, §2, Figure 2, §3.1, §3, 1st item.
 Unsupervised learning of disentangled representations from video. In NeurIPS, Cited by: §2.
 Class rectification hard mining for imbalanced deep learning. In ICCV, Cited by: §2.
 Discriminative unsupervised feature learning with exemplar convolutional neural networks. In TPAMI, Cited by: §2.
 Learning disentangled joint continuous and discrete representations. In NeurIPS, Cited by: §2, 7th item.
 Generative adversarial nets. In NeurIPS, Cited by: §1, §3.1.
 Msceleb1m: a dataset and benchmark for largescale face recognition. In ECCV, Cited by: §2.
 ADASYN: adaptive synthetic sampling approach for imbalanced learning. In IJCNN, Cited by: §2.
 Betavae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §1, §2.
 Transforming autoencoders. In ICANN, Cited by: §2.
 Disentangling factors of variation by mixing them. In CVPR, Cited by: §1, §2.
 Learning discrete representations via information maximizing selfaugmented training. In ICML, Cited by: §2.
 Learning deep representation for imbalanced classification. In CVPR, Cited by: §2.
 Direct modeling of complex invariances for visual object features. In ICML, Cited by: §2.
 Categorical reparameterization with gumbelsoftmax. ICLR. Cited by: §1, §3.2, 4th item.
 Invariant information clustering for unsupervised image classification and segmentation. In ICCV, Cited by: §2.
 Autoencoding variational bayes. ICLR. Cited by: §1.
 The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §1, §2, §3, §4.1.
 Deep learning face attributes in the wild. In ICCV, Cited by: §2.
 The concrete distribution: a continuous relaxation of discrete random variables. In ICLR, Cited by: §1, §3.2.
 Exploring the limits of weakly supervised pretraining. In ECCV, Cited by: §2.
 Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: §2.
 Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop, Cited by: §2.
 An unsupervised selforganizing learning with support vector ranking for imbalanced datasets. Expert Systems with Applications. Cited by: §2.
 A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. Cited by: §1.
 Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR. Cited by: §1.
 Improved techniques for training gans. In NeurIPS, Cited by: §1.
 Deep continuous clustering. In arXiv, Cited by: §A.1, §4.1.
 Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, Cited by: §2.
 FineGAN: unsupervised hierarchical disentanglement for finegrained object generation and discovery. In CVPR, Cited by: §1, §2.
 A comparative study of costsensitive boosting algorithms. In ICML, Cited by: §2.
 Disentangled representation learning gan for poseinvariant face recognition. In CVPR, Cited by: §1, §2.
 The inaturalist species classification and detection dataset. In CVPR, Cited by: §2.
 Representation learning: a review and new perspectives. TPAMI. Cited by: §2.
 Face recognition in unconstrained videos with matched background similarity. In CVPR, Cited by: §3, §4.1.
 StackGAN++: realistic image synthesis with stacked generative adversarial networks. TPAMI. Cited by: §A.1.
 Document clustering based on nonnegative matrix factorization. In SIGIR, Cited by: 2nd item.
 Attribute2Image: conditional image generation from visual attributes. In ECCV, Cited by: §1, §2.
 Scalable exemplarbased subspace clustering on classimbalanced data. In ECCV, Cited by: §2.
 Domain adaptation for semantic segmentation via classbalanced selftraining. ECCV. Cited by: §2.
Appendix A Appendix
A.1 Implementation details (continued)
For MNIST, we operate on the original 28x28 image size, with 10dimensional categorical code to represent 10 digit categories. For YouTubeFaces, we crop the faces using bounding box annotations provided, and then resize them to 64x64 resolution, and use a 40dimensional categorical code to represent 40 face identities (first 40 categories sorted in alphabetical manner), as done in Shah and Koltun (2018). Pretrained classification architecture used for evaluation for MNIST: 2 Conv + 2 FC layers, with max pool and ReLU after every convolutional layer. For YouTubeFaces classification, we finetune a ResNet50 network pretrained on VGGFace2, for face recognition. We set ${\lambda}_{1}=1$ (for ${L}_{1}$), ${\lambda}_{2}=10$ (for ${L}_{trans}$), and ${\lambda}_{3}=1$ (for ${L}_{ent}$). These hyperparameters were chosen to balance the magnitude of the different loss terms. Finally, one behavior we observe is that if the random initialization of class probabilities is too skewed (only few classes have high probability values), then it becomes very difficult for them to get optimized to the ideal state. We hence initialize them with the uniform distribution, which makes training much more stable.
ElasticInfoGAN architecture for MNIST:
We follow the exact architecture as described in InfoGAN (Chen et al. (2016)): The generator network $G$ takes as input a $64$ dimensional noise vector $z\sim \mathcal{N}(0,1)$ and 10 dimensional samples from GumbelSoftmax distribution. The discriminator $D$ and the latent code prediction network $Q$ share most of the layers except the final fully connected layers.
ElasticInfoGAN architecture for YouTube Faces
We operate on cropped face images resized to 64x64 resolution. Our architecture is based on the one proposed in StackGANv2 (Xu et al. (2018)), where we use its 2stage version for generating 64x64 resolution images. The input is a $100$ dimensional noise vector $z\sim \mathcal{N}(0,1)$ and 40 dimensional samples ($c$) from the GumbelSoftmax distribution. There is an initial fully connected layer which maps the input (concatenation of $z$ and $c$) to an intermediate feature representation. A series of a combination of upsampling + convolutional (interleaved with batch normalization and Gated Linear Units) increase the spatial resolution of the feature representation, starting from 1024 (feature size: 4 x 4 x 1024) channels to 64 (feature size: 64 x 64 x 64) channels. For the first stage, a convolutional network transforms the feature representation into a 3 channel output, while maintaining the spatial resolution; this serves as the fake image from the first stage. The next stage uses the 64 x 64 x 64 resolution features, forwards it through a network containing residual blocks and convolutional layers, while again maintaining the spatial resolution of 64 x 64. For the second stage, again a convolutional layer maps the resulting feature into a 64 x 64 resolution fake image, which is the one used by the model for evaluation purposes. The discriminator networks are identical at both stages. It consists of 4 convolutional layers interleaved with batch normalization and leaky ReLU layers, which serve as the common layers for both the $D$ and $Q$ networks. After that, $D$ has one nonshared convolutional layer which maps the feature representation into a scalar value reflecting the real/fake score. For $Q$, we have a pair of nonshared convolutional layers which map the feature representation into a 40 dimensional latent code prediction.
Training of ElasticInfoGAN
We employ a similar way of training the generative and discriminative modules as described in Chen et al. (2016). We first update the discriminator based on the real/fake adversarial loss. In the next step, after computing the remaining losses (mutual information + ${L}_{trans}$ + ${L}_{ent}$), we update the generator ($G$) + latent code predictor ($Q$) + latent distribution parameters at once. Our optimization process alternates between these two phases. For MNIST, we train all baselines for 200 epochs, with a batch size of 64. For YouTubeFaces, we train until convergence, as measured via qualitative realism of the generated images. We use a batch size of 50. $\tau =0.1$ when used for sampling from GumbelSoftmax, which results in samples having very low entropy (very close to one hot vectors from a categorical distribution).
A.2 Ground truth class imbalance
Here we describe the exact class imbalance used in our experiments. For MNIST, we include below the 50 random imbalances created. For YouTubeFaces, we include the true ground truth class imbalance in the first 40 categories. The imbalances reflect the class frequency.
A.2.1 MNIST

•
0.147, 0.037, 0.033, 0.143, 0.136, 0.114, 0.057, 0.112, 0.143, 0.078

•
0.061, 0.152, 0.025, 0.19, 0.12, 0.036, 0.092, 0.185, 0.075, 0.064

•
0.173, 0.09, 0.109, 0.145, 0.056, 0.114, 0.075, 0.03, 0.093, 0.116

•
0.079, 0.061, 0.033, 0.139, 0.145, 0.135, 0.057, 0.062, 0.169, 0.121

•
0.053, 0.028, 0.111, 0.142, 0.13, 0.121, 0.107, 0.066, 0.125, 0.118

•
0.072, 0.148, 0.092, 0.081, 0.119, 0.172, 0.05, 0.109, 0.085, 0.073

•
0.084, 0.143, 0.07, 0.082, 0.059, 0.163, 0.156, 0.063, 0.074, 0.105

•
0.062, 0.073, 0.065, 0.183, 0.099, 0.08, 0.05, 0.16, 0.052, 0.177

•
0.139, 0.113, 0.074, 0.06, 0.068, 0.133, 0.142, 0.13, 0.112, 0.03

•
0.046, 0.128, 0.059, 0.112, 0.135, 0.164, 0.142, 0.125, 0.051, 0.037

•
0.107, 0.057, 0.154, 0.122, 0.05, 0.111, 0.032, 0.044, 0.136, 0.187

•
0.129, 0.1, 0.039, 0.112, 0.119, 0.095, 0.047, 0.14, 0.156, 0.064

•
0.146, 0.08, 0.06, 0.072, 0.051, 0.119, 0.176, 0.11, 0.158, 0.028

•
0.035, 0.051, 0.112, 0.143, 0.033, 0.165, 0.082, 0.165, 0.054, 0.161

•
0.041, 0.1, 0.073, 0.054, 0.155, 0.117, 0.091, 0.124, 0.142, 0.104

•
0.052, 0.139, 0.128, 0.133, 0.104, 0.107, 0.058, 0.137, 0.036, 0.107

•
0.055, 0.138, 0.059, 0.074, 0.08, 0.135, 0.085, 0.064, 0.172, 0.139

•
0.141, 0.156, 0.119, 0.062, 0.08, 0.022, 0.043, 0.159, 0.101, 0.118

•
0.11, 0.088, 0.033, 0.062, 0.089, 0.176, 0.161, 0.105, 0.144, 0.032

•
0.157, 0.111, 0.125, 0.099, 0.036, 0.119, 0.036, 0.05, 0.147, 0.121

•
0.119, 0.121, 0.117, 0.152, 0.026, 0.174, 0.027, 0.065, 0.151, 0.049

•
0.057, 0.07, 0.134, 0.118, 0.058, 0.185, 0.07, 0.13, 0.116, 0.063

•
0.102, 0.082, 0.135, 0.046, 0.128, 0.106, 0.116, 0.085, 0.133, 0.066

•
0.057, 0.193, 0.2, 0.123, 0.022, 0.154, 0.115, 0.025, 0.065, 0.047

•
0.056, 0.196, 0.168, 0.052, 0.116, 0.062, 0.099, 0.133, 0.065, 0.053

•
0.04, 0.022, 0.2, 0.194, 0.038, 0.033, 0.161, 0.097, 0.159, 0.056

•
0.04, 0.036, 0.119, 0.204, 0.16, 0.103, 0.089, 0.061, 0.136, 0.052

•
0.112, 0.189, 0.145, 0.163, 0.113, 0.031, 0.028, 0.062, 0.045, 0.112

•
0.071, 0.099, 0.113, 0.175, 0.082, 0.068, 0.03, 0.066, 0.133, 0.164

•
0.134, 0.074, 0.111, 0.091, 0.051, 0.119, 0.044, 0.085, 0.144, 0.148

•
0.103, 0.126, 0.084, 0.117, 0.084, 0.127, 0.131, 0.092, 0.117, 0.019

•
0.096, 0.121, 0.026, 0.046, 0.043, 0.124, 0.165, 0.04, 0.127, 0.213

•
0.117, 0.115, 0.125, 0.128, 0.081, 0.103, 0.073, 0.044, 0.137, 0.077

•
0.037, 0.021, 0.143, 0.165, 0.075, 0.111, 0.028, 0.132, 0.134, 0.154

•
0.154, 0.049, 0.128, 0.089, 0.082, 0.072, 0.034, 0.138, 0.108, 0.146

•
0.078, 0.141, 0.084, 0.139, 0.085, 0.062, 0.035, 0.174, 0.15, 0.053

•
0.112, 0.112, 0.128, 0.112, 0.107, 0.142, 0.032, 0.142, 0.063, 0.049

•
0.084, 0.091, 0.128, 0.129, 0.045, 0.105, 0.05, 0.091, 0.089, 0.188

•
0.062, 0.136, 0.112, 0.153, 0.091, 0.046, 0.089, 0.03, 0.161, 0.12

•
0.143, 0.1, 0.046, 0.166, 0.107, 0.191, 0.026, 0.078, 0.097, 0.047

•
0.077, 0.174, 0.05, 0.098, 0.028, 0.173, 0.067, 0.106, 0.096, 0.13

•
0.105, 0.022, 0.183, 0.056, 0.045, 0.103, 0.081, 0.135, 0.119, 0.149

•
0.083, 0.127, 0.126, 0.028, 0.209, 0.03, 0.066, 0.125, 0.1, 0.107

•
0.138, 0.142, 0.074, 0.091, 0.103, 0.067, 0.12, 0.04, 0.1, 0.124

•
0.058, 0.039, 0.088, 0.113, 0.093, 0.055, 0.162, 0.069, 0.168, 0.155

•
0.02, 0.162, 0.133, 0.138, 0.137, 0.051, 0.069, 0.032, 0.118, 0.14

•
0.071, 0.046, 0.134, 0.119, 0.159, 0.057, 0.039, 0.135, 0.057, 0.184
A.2.2 YouTubeFaces

•
0.0189, 0.0131, 0.0242, 0.0201, 0.0284, 0.0225, 0.0526, 0.0103, 0.062, 0.0306, 0.0365, 0.0053, 0.0106, 0.027, 0.0339, 0.0333, 0.0091, 0.0063, 0.0115, 0.0162, 0.0236, 0.0466, 0.028, 0.069, 0.0119, 0.0063, 0.0241, 0.0053, 0.0064, 0.0241, 0.0053, 0.0375, 0.0277, 0.0562, 0.0594, 0.0258, 0.0082, 0.006, 0.0281, 0.0281
A.3 Discussion about evaluating predicted class imbalance in Sec. 4.2
To measure the ability of a generative model to approximate the class imbalance present in the data, we derive a metric in Section 4.2 of the main paper, the results of which are presented in Table 2. Even though we do get better results as measured by RMSE between the approximated and the original imbalance distribution, we would like to discuss certain flaws associated with this metric.
In its current form, we compute the class histogram (using the pretrained classifier, which classifies each fake image into one of the groundtruth categories) for a latent code and associate the latent code to the most frequent class. If multiple latent codes get associated to the same groundtruth class, there will be groundtruth classes for which the predicted class probability will be zero. This is rarely an issue for MNIST, as it only has 10 groundtruth classes, and thus in most cases both our method and the baselines assign each latent code to a unique groundtruth class. However, for YouTubeFaces, after associating latent codes to the ground truth categories in this manner, roughly 1013 groundtruth classes (out of 40) get associated with 0 probability for both our approach and the baselines (due to multiple latent codes being associated to the same majority groundtruth class). Our metric therefore may be too strict, especially for difficult settings with many confusing groundtruth categories.
The tricky part about evaluating how well the model is approximating the class imbalance is that there are two key aspects that need to be simultaneously measured. Specifically, not only should (i) the raw probability values discovered match the groundtruth class imbalance distribution, but (ii) the class probabilities approximated by the latent codes must correspond to the correct groundtruth classes. For example, if the original data had 80% samples from class A and 20% from class B, the generative model should not only estimate the imbalance as 80%20%, but the model must associate 80% to class A and 20% to class B (instead of 80% to class B and 20% to class A). Another way to evaluate whether a model is capturing the groundtruth class imbalance could be the FID score, but it’s worth noting that a method can still have a good FID score without disentangling the different factors of variations.
Given the limitation with our metric on YouTubeFaces, we have also measured the min/max of predicted prior values. For YouTubeFaces, the min/max of predicted and groundtruth priors are: GumbelSoftmax: Min 2.76748415e05, Max: 0.0819286481; Ours without ${L}_{e\mathbf{}n\mathbf{}t}$: Min 0.00211485, Max: 0.06152404; Ours complete: Min 0.00336615, Max: 0.06798439; and GroundTruth: Min 0.005265, Max: 0.069044. Our full method’s min/max more closely matches that of the groundtruth, and the overall ordering of the methods follows that of Table 2 using our RMSE based metric.
In sum, we have made an effort to evaluate accurate class imbalance prediction in multiple ways, but it is important to note that this is an area which calls for better metrics to evaluate the model’s ability to approximate the class imbalance distribution.