MixNMatch: Multifactor Disentanglement and Encodingfor Conditional Image Generation

  • 2019-11-26 18:49:39
  • Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee
  • 28


We present MixNMatch, a conditional generative model that learns todisentangle and encode background, object pose, shape, and texture from realimages with minimal supervision, for mix-and-match image generation. We buildupon FineGAN, an unconditional generative model, to learn the desireddisentanglement and image generator, and leverage adversarial joint image-codedistribution matching to learn the latent factor encoders. MixNMatch requiresbounding boxes during training to model background, but requires no othersupervision. Through extensive experiments, we demonstrate MixNMatch's abilityto accurately disentangle, encode, and combine multiple factors formix-and-match image generation, including sketch2color, cartoon2img, andimg2gif applications. Our code/models/demo can be found athttps://github.com/Yuheng-Li/MixNMatch


Quick Read (beta)

MixNMatch: Multifactor Disentanglement and Encoding
for Conditional Image Generation

Yuheng Li        Krishna Kumar Singh        Utkarsh Ojha        Yong Jae Lee
University of California, Davis

We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN [singh-cvpr2019], an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching [donahue-iclr2017, dumoulin-iclr2017] to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch’s ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch

Figure 1: Conditional mix-and-match image generation. Our model, MixNMatch, can disentangle and encode up to four factors—background, object pose, shape, and texture—from real reference images, and can arbitrarily combine them to generate new images. The only supervision used to train our model is bounding box annotations to model background.

1 Introduction

Consider the real image of the yellow bird in Figure 1, 1st column. What would the bird look like in a different background, say that of the duck? How about in a different texture, perhaps that of the rainbow textured bird in the 2nd column? What if we wanted to keep its texture, but change its shape to that of the rainbow bird, and background and pose to that of the duck, as in the 3rd column? How about sampling shape, pose, texture, and background from four different reference images and combining them to create an entirely new image (last column)?


While research in conditional image generation has made tremendous progress [Isola-cvpr2017, zhu-iccv2017, park-cvpr2019], no existing work can simultaneously disentangle background, object pose, object shape, and object texture with minimal supervision, so that these factors can be combined from multiple real images for fine-grained controllable image generation. Learning disentangled representations with minimal supervision is an extremely challenging problem, since the underlying factors that give rise to the data are often highly correlated and intertwined. Work that disentangle two such factors, by taking as input two reference images e.g., one for appearance and the other for pose, do exist [huang-eccv2018, joo-cvpr18, lee-eccv18, lorenz-cvpr2019, xiao-iccv2019], but they cannot disentangle other factors such as foreground vs. background appearance or pose vs. shape. Since only two factors can be controlled, these approaches cannot arbitrarily change, for example, the object’s background, shape, and texture, while keeping its pose the same. Others require strong supervision in the form of keypoint/pose/mask annotations [peng-iccv2017, balakrishnan-cvpr2018, ma-cvpr2018, esser-cvpr2018], which limits their scalability, and still fall short of disentangling all of the four factors outlined above.

Our proposed conditional generative model, MixNMatch, aims to fill this void. MixNMatch learns to disentangle and encode background, object pose, shape, and texture latent factors from real images, and importantly, does so with minimal human supervision. This allows, for example, each factor to be extracted from a different real image, and then combined together for mix-and-match image generation; see Fig. 1. During training, MixNMatch only requires a loose bounding box around the object to model background, but requires no other supervision for modeling the object’s pose, shape, and texture.

Main idea.

Our goal of mix-and-match image generation i.e., generating a single synthetic image that combines different factors from multiple real reference images, requires a framework that can simultaneously learn (1) an encoder that encodes latent factors from real images into a disentangled latent code space, and (2) a generator that takes latent factors from the disentangled code space for image generation. To learn the generator and the disentangled code space, we build upon FineGAN [singh-cvpr2019], a generative model that learns to hierarchically disentangle background, object pose, shape, and texture with minimal supervision using information theory. However, FineGAN is conditioned only on latent random codes, and cannot be directly conditioned on real images for image generation. We therefore need a way to extract latent codes that control background, object pose, shape, and texture from real images, while preserving FineGAN’s hierarchical disentanglement properties. As we show in the experiments, a naive extension of FineGAN in which an encoder is trained to map a fake image into the codes that generated it is insufficient due to the domain gap between real and fake images.

To simultaneously achieve the above dual goals, we instead perform adversarial learning, whereby the joint distribution of real images and their extracted latent codes from the encoder, and the joint distribution of sampled latent random codes and corresponding generated images from the generator, are learned to be indistinguishable, similar to ALI [dumoulin-iclr2017] and BiGAN [donahue-iclr2017]. By enforcing matching joint image-code distributions, the encoder learns to produce latent codes that match the distribution of sampled codes with the desired distentanglement properties, while the generator learns to produce realistic images. To further encode a reference image’s shape and pose factors with high fidelity, we augment MixNMatch with a feature mode in which higher dimensional features of the image, rather than low dimensional codes, that preserve pixel-level structure are mapped onto a richer form of the learned disentangled code space, again via distribution matching using adversarial learning.


(1) We introduce MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture factors from real images with minimal human supervision. This gives MixNMatch fine-grained control in image generation, where each factor can be uniquely controlled. MixNMatch can take as input either real reference images, sampled latent codes, or a mix of both. (2) Through various qualitative and quantitative evaluations, we demonstrate MixNMatch’s ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation. Furthermore, we show that MixNMatch’s learned disentangled representation leads to state-of-the-art fine-grained object category clustering results of real images. (3) We demonstrate a number of interesting applications of MixNMatch including sketch2color, cartoon2img, and img2gif.

Figure 2: MixNMatch architecture. (a) Four different encoders, one for each factor, take a real image as input to predict the codes. (b) Four different latent codes are sampled and fed into the FineGAN generator to hierarchically generate images. (c) Four image-code pair discriminators optimize the encoders and generator, to match their joint image-code distributions.

2 Related work

Conditional image generation

has various forms, including models conditioned on a class label [odena-icml2017, miyato-iclr2018, brock-iclr2019] or text input [reed-icml2016, stackgan2, xu-cvpr2018, yin-cvpr2019]. A lot of work focuses on image-to-image translation, where an image from one domain is mapped onto another domain e.g., [Isola-cvpr2017, zhu-iccv2017, park-cvpr2019]. However, these methods typically lack the ability to explicitly disentangle the factors of variation in the data. Those that do learn disentangled representations focus on specific domains like faces/humans [tran-cvpr2017, peng-iccv2017, bao-cvpr2018, pumarola-eccv2018, balakrishnan-cvpr2018, ma-cvpr2018] or require clearly defined domains (e.g., pose vs. identity or style/attribute vs. content) [joo-cvpr18, huang-eccv2018, lee-eccv18, gonzalez-nips2018, liu-nips2018, xiao-iccv2019]. In contrast, MixNMatch is not specific to any object category, and does not require clearly defined domains as it disentangles multiple factors of variation within a single domain (e.g., natural images of birds). Moreover, unlike most unsupervised methods which can disentangle only two factors like shape and appearance [li-ijcai2018, shu-eccv2018, lorenz-cvpr2019], MixNMatch can disentangle four (background, object shape, pose, and texture).

Disentangled representation learning

aims to disentangle the underlying factors that give rise to real world data [chen-nips16, yan-eccv16, xing-cvpr2018, li-ijcai2018, shu-eccv2018, tulyakov-cvpr18, hu-cvpr18, karras-cvpr2019, lorenz-cvpr2019]. Most unsupervised methods are limited to disentangling at most two factors like shape and texture [li-ijcai2018, shu-eccv2018]. Others require strong supervision in the form of edge/keypoint/mask annotations or detectors [peng-iccv2017, balakrishnan-cvpr2018, ma-cvpr2018, esser-cvpr2018], or rely on video to automatically acquire identity labels [denton-nips2017, joo-cvpr18, xiao-iccv2019]. Our most related work is FineGAN [singh-cvpr2019], which leverages information theory [chen-nips16] to disentangle background, object pose, shape, and texture with minimal supervision. However, it is conditioned only on latent codes, and thus cannot perform image translation. We build upon this work to enable conditioning on real images. Importantly, we show that a naive extension is insufficient, and further improve the quality of our model’s image generations to preserve instance specific details from the reference images. Since MixNMatch is directly conditioned on images, its learned representation leads to better disentanglement and fine-grained clustering of real images.

3 Approach

Let ={x1,,xN} be an unlabeled image collection of a single object category (e.g., birds). Our goal is to learn a conditional generative model, MixNMatch, which simultaneously learns to (1) encode latent background, object pose, shape, and texture factors associated with images in into a disentangled latent code space (i.e., where each factor is uniquely controlled by a code), and (2) generate high quality images matching the true data distribution Pdata(x) by combining latent factors from the disentangled code space.

We first briefly review FineGAN [singh-cvpr2019], from which we base our generator. We then explain how to train our model to disentangle and encode background, object pose, shape, and texture from real images, so that it can combine different factors from different real reference images for mix-and-match image generation. Lastly, we introduce how to augment our model to preserve object shape and pose information from a reference image with high fidelity (i.e., at the pixel-level), while still altering the background and object texture according to their respective reference images.

3.1 Background: FineGAN

FineGAN [singh-cvpr2019] takes as input four randomly sampled latent codes (z, b, c, p) to hierarchically generate an image in three stages: (1) a background stage where the model only generates the background, conditioned on latent one-hot background code b; (2) a parent stage where the model generates the object’s shape, conditioned on latent one-hot parent code p, and stitches it to the existing background image; and (3) a child stage where the model fills in the object’s texture, conditioned on latent one-hot child code c. In both the parent and child stages, FineGAN automatically generates masks (without any mask supervision) to capture the appropriate shape and texture details. To disentangle the background, it relies on object bounding boxes (e.g., acquired through an object detector). To disentangle the remaining factors of variation without any supervision, FineGAN uses information theory (similar to InfoGAN [chen-nips16]), and imposes specific constraints on the relationships between the latent codes (detailed in Sec. 3.3). These induce the random noise vector z, background code b, parent code p, and child code c to capture the object pose, background, object shape, and object texture, respectively.

FineGAN is trained with three losses, one for each stage, which combine adversarial training [goodfellow-nips2014] and mutual information maximization [chen-nips16]. We simply denote its full loss as:

finegan=b+p+c, (1)

where b, p, and c denote the losses in the background, parent, and child stages. For more details on these losses and the FineGAN architecture, please refer to [singh-cvpr2019].

3.2 Paired image-code distribution matching

Although FineGAN can disentangle multiple factors to generate realistic images, it is conditioned on sampled latent codes, and cannot be conditioned on real images. A naive post-processing extension in which encoders that learn to map fake images to the codes that generated them is insufficient due to the domain gap between real and fake images [singh-cvpr2019], as we show in our experiments.

Thus, in order to encode disentangled representations from real image inputs for conditional mix-and-match image generation, we need a way to extract the random vector z (which controls object pose), b (which controls the background), p (which controls object shape), and c (which controls object texture) codes from real images, while preserving the hierarchical disentanglement properties of FineGAN. For this, we propose to train four encoders, each of which predict the z,b,p,c codes from real input images. Since FineGAN has the ability to disentangle factors and generate images given random latent codes, we naturally resort to using it as our generator, by keeping all the losses (i.e., Lfinegan) in the original framework to help the encoders learn the desired disentanglement.

Specifically, for each real training image x, we use the corresponding encoders to extract its z,b,p,c codes. However, we cannot simply input these codes to the generator to reconstruct the image, as the model would take a shortcut and degenerate into a simple autoencoder that does not preserve FineGAN’s disentanglement properties (factorization into background, pose, shape, texture), as we show in our experiments. We therefore leverage ideas from ALI [dumoulin-iclr2017] and BiGAN [donahue-iclr2017] to help the encoders learn the inverse mapping; i.e., a projection of real images into the code space, in a way that maintains the desired disentanglement properties.

The key idea is to perform adversarial learning [goodfellow-nips2014, donahue-iclr2017, dumoulin-iclr2017], so that the paired image-code distribution produced by the encoder (xPdata,y^E(x)) and the paired image-code distribution produced by the generator (x^G(y),yPcode) are matched. Here E is the encoder, G is the FineGAN generator, and y is a place holder for the latent codes z,b,p,c. Pdata is the data (real image) distribution and Pcode is the latent code distribution.11 1 Following FineGAN [singh-cvpr2019]: a continuous noise vector z𝒩(0,1); a categorical background code bCat(K=Nb,p=1/Nb); a categorical parent code pCat(K=Np,p=1/Np); and a categorical child code cCat(K=Nc,p=1/Nc)). Nb, Np, Nc are the number of background, parent, and child categories and are set as hyperparameters. Formally, the input to the discriminator D is an image-code pair. When training D, we set the paired real image x and code y^ extracted from the encoder E to be real, and the paired sampled input code y and generated image x^ from the generator G to be fake. Conversely, when training G and E, we try to fool D so that the paired distributions P(data,E(x)) and P(G(y),code) are indistinguishable, via a paired adversarial loss:

bi_adv= minG,EmaxD𝔼xPdata𝔼y^E(x)[logD(x,y^)]
+ 𝔼yPcode𝔼x^G(y)[log(1-D(x^,y))]. (2)

This loss will simultaneously enforce the (1) generated images x^G(y) to look real, and (2) extracted real image codes y^E(x) to capture the desired factors (i.e., pose, background, shape, appearance). Fig. 2 (a-c) show our encoders, generator, and discriminators.

3.3 Relaxing the latent code constraints

There is an important issue that we must address to ensure disentanglement in the extracted codes. FineGAN imposes strict code relationship constraints, which are key to inducing the desired disentanglement in an unsupervised way, but which can be difficult to realize in all real images. Specifically, during training, FineGAN constrains the sampled child codes into disjoint groups so that each group has the same unique parent code, and enforces the sampled background and child codes for each generated image to be the same [singh-cvpr2019]. This is because objects often differ in texture conditioned on a shared shape (e.g., different duck species share the same shape but differ in their texture details), and background is often correlated with specific object types (e.g., flying birds typically have sky as background).

However, for any real image, these strict relationships may not hold (e.g., a flying bird with trees in background), and would thus be difficult to enforce in its extracted codes. In this case, the discriminator would easily be able to tell whether the image-code pair is real or fake (based on the code relationships), which will cause issues with learning. It can also confuse the background b and texture c encoders since the background and child latent codes are always sampled to be the same.

Figure 3: Comparison between code mode & feature mode. Rows 1-3 are real reference images, in which we extract background b, texture c, and shape+pose p & z, respectively. Rows 4-5 are MixNMatch’s feature mode (which accurately preserves original shape information) and code mode (which preserves shape information at a semantic level) generations.

We address this issue in two ways. First, we train four separate discriminators, one for each code type. This prevents any discriminator from seeing the other codes, and thus cannot discriminate based on the relationships between codes. Second, when training the encoders, we also provide as input fake images that are generated with randomly sampled codes with the code constraints removed. Specifically, we train the encoders E to predict back the sampled codes y that were used to generate the corresponding fake image:

code_pred=CE(E(G(y)),y), (3)

where CE() denotes cross-entropy loss, and y is a place holder for the latent codes b,p,c. (For continuous z, we use L1 loss.) This loss helps to guide each encoder, and in particular the b and c encoders, to learn the corresponding factor. Note that the above loss is used only to update the encoders E, as these fake images can have feature combinations that generally do not exist in the real data distribution (e.g., a duck on top of a tree).

3.4 Optional feature mode for exact shape and pose

Thus far, MixNMatch’s encoders can take in up to four different real images and encode them into b,z,p,c codes which model the background, object pose, shape, and texture, respectively. These codes can then be used by MixNMatch’s generator to generate realistic images, which combine the four factors from the respective reference images. We denote this setting as MixNMatch’s code mode. While the generated images already capture the factors with high accuracy (see Fig. 3, “code mode”), certain image translation applications may require exact pixel-level shape and pose alignment between a reference image and the output.

The main reason that MixNMatch in code mode cannot preserve exact pixel-level shape and pose details of a reference image is because the capacity of the latent code space is too small to model per-instance pixel-level details (typically, tens in dimension for p, which is responsible for capturing shape). The reason it must be small is because it must (roughly) match the e.g., number of unique modes of the corresponding factor. In this section, we introduce MixNMatch’s optional feature mode to address this. Rather than encode a reference image into a low-dimensional code, the key idea is to directly learn a mapping from the image to a higher-dimensional feature space that preserves rich shape and pose (pixel-level) details.

Specifically, we take our learned MixNMatch generator G, and use it to train a new shape and pose feature extractor S, which takes as input a real image x and outputs feature S(x). Recall that G takes as input a code y to generate the image G(y); i.e. yG(y). Let’s denote an intermediate parent stage feature (which captures shape and pose) from the generator as ϕ(y); i.e. yϕ(y)G(y). We use the standard adversarial loss [goodfellow-nips2014] to train S so that the distribution of S(x) matches that of ϕ(y) (i.e., only S is learned and ϕ(y) is produced from the already trained G). Ultimately, this trains S to produce features that match those sampled from the ϕ(y) distribution, which already has learned to encode shape and pose. Next, to enforce S to preserve instance-specific shape and pose details of x (i.e., so that the resulting generated image using S(x) is spatially-aligned to x), we randomly generate fake images using G, and for each fake image G(y), we enforce an L1 loss between the feature ϕ(y) generated using the sampled codes y and the feature S(G(y)). See supp. for the network figure.

Once trained, we can use MixNMatch’s feature mode to extract the pixel-aligned pose and shape feature S(x) from an input image x, and combine it with the background b and texture c codes extracted from (up to) two reference images, to perform conditional mix-and-match image generation.

Figure 4: Varying a single factor. Real images are indicated with red boxes. For (a-d), the reference images on the left/top provide three/one factors. The center 3x3 images are generations. For example, in (a) the top row yellow bird has an upstanding pose with its head turned to the right, and the resulting images have the same pose.

4 Experiments

We evaluate MixNMatch’s conditional mix-and-match image generation results, its ability to disentangle each latent factor, and its learned representation for fine-grained object clustering of real images. We also showcase sketch2color, cartoon2img, and img2gif applications.


(1) CUB [wah-tech11]: 11,788 bird images from 200 classes; (2) Stanford Dogs [khosla-FGVC11]: 12,000 dog images from 120 classes; (3) Stanford Cars [krause-DRR2013]: 8,144 car images from 196 classes. We set the prior latent code distributions following FineGAN [singh-cvpr2019]. The only supervision we use is bounding boxes to model background during training.


We compare to a number of state-of-the-art GAN, disentanglement, and clustering methods. For all methods, we use the authors’ public code. The code for SC-GAN [kazemi-wacv2018] only has the unconditional version, so we implement its BiGAN [donahue-iclr2017] variant following the paper details.

Implementation details.

We train and generate 128 x 128 images. In feature mode (2nd stage) training, ϕ(y) is a learned distribution from the code mode (1st stage) and may not model the entire real feature distribution (e.g., due to mode collapse). Thus, we assume that patch-level features are better modeled, and apply a patch discriminator. For our feature mode, since the predicted object masks are often highly accurate, we can optionally directly stitch the foreground (if only changing background) or background (if only changing texture) from the corresponding reference image. When optimizing Eqn. 2, we add noise to D since the sampled c, p, b are one hot, while predicted c^, p^, b^ will never be one-hot. Full training details are in the supp.

4.1 Qualitative Results

Conditional mix-and-match image generation.

We show results on CUB, Stanford Cars, and Stanford Dogs; see Fig. 3. The first three rows show the background, texture, and shape + pose reference (real) images from which our model extracts b, c, and p & z, respectively, while the fourth and fifth rows show MixNMatch’s feature mode and code mode generation results, respectively.

Our feature mode results (4th row) demonstrate how well MixNMatch preserves shape and pose information from the reference images (3th rows), while transferring background and texture information (from 1st and 2nd rows). For example, the generated bird in the second column preserves the exact pose and shape of the bird standing on the pipe (3rd row) and transfers the brownish bark background and rainbow object texture from the 1st and 2nd row images, respectively. Our code mode results (5th row) also capture the different factors from the reference images well, though not as well as the feature mode for pose and shape. Thus, this mode is more useful for applications in which inexact instance-level pose and shape transfer is acceptable (e.g., generating a completely new instance which captures the factors at a high-level). Overall, these results highlight how well MixNMatch disentangles and encodes factors from real images, and preserves them in the generation.

Figure 5: Latent code interpolation. Images in the red boxes are real, and intermediate images are generated by linearly interpolating codes predicted by our encoders.

Note that here we take both z and p from the same reference image (row 3) in order to perform a direct comparison between the code and feature modes. We next show results of disentangling all four factors, including z and p.

Figure 6: sketch2color. First three rows are real reference images. Last row shows generation results of adding background and texture to the sketch images.

Disentanglement of factors.

Here we evaluate how well MixNMatch disentangles each factor (background b, texture c, pose z, shape p). Fig. 4 shows our disentanglement of each factor on CUB (results for Dogs and Cars are in the supp.). For each subfigure, the images in the top row and leftmost column (with red borders) are real reference images. The specific factors taken from each image are indicated in the top-left corner; e.g., in (a), pose is taken from the top row, while background, shape, texture are taken from the leftmost column. Note how we can make (a) a bird change poses by varying z, (b) change just the background by varying b, (c) colorize by varying c, and (d) change shape by varying p (e.g., see the duck example in 3rd column).

Latent code interpolation.

In Fig. 5 we encode the z, b, p, c codes from the two real images (first and last columns), linearly interpolate each code, and generate the interpolated images. MixNMatch produces perceptually smooth transitions for each factor, which again suggests that it has learned a highly disentangled latent space [karras-cvpr2019].

Inception Score FID
Birds Dogs Cars Birds Dogs Cars
Simple-GAN 31.85 ± 0.17 6.75 ± 0.07 20.92 ± 0.14 16.69 261.85 33.35
InfoGAN [chen-nips16] 47.32 ± 0.77 43.16 ± 0.42 28.62 ± 0.44 13.20 29.34 17.63
LR-GAN [yang-iclr17] 13.50 ± 0.20 10.22 ± 0.21 5.25 ± 0.05 34.91 54.91 88.80
StackGANv2 [stackgan2] 43.47 ± 0.74 37.29 ± 0.56 33.69 ± 0.44 13.60 31.39 16.28
FineGAN [singh-cvpr2019] 52.53 ± 0.45 46.92 ± 0.61 32.62 ± 0.37 11.25 25.66 16.03
MixNMatch (Ours) 53.03 ± 0.55 47.16 ± 0.67 32.68 ± 0.47 12.49 20.98 17.23
Table 1: Image quality & diversity. IS ( better) and FID ( better). MixNMatch generates diverse, high-quality images that compare favorably to state-of-the-art baselines.

sketch2color / cartoon2img.

MixNMatch can also adapt to other domains not seen during training. Figs. 6 and 7 show results where shape and pose information are taken from sketch / cartoon images. The results also indicate that MixNMatch learns part information, without supervision. For example, in Fig. 7 column 2, it can correctly transfer the black, white, and red part colors to the rubber duck.


MixNMatch can also be used to animate a static image; see Fig. 8 and supp. for a video result.

4.2 Quantitative Results

Image diversity and quality.

We compute Inception Score [salimans-nips16] and FID [FID] over 30K randomly generated images. We condition the generation only on sampled latent codes (by sampling z, p, c, b from their prior distributions; see Footnote 1), and not on real image inputs, for a fair comparison. Table 1 shows that MixNMatch generates diverse and realistic images that are competitive to state-of-the-art unconditional GAN methods.

Fine-grained object clustering.

We next evaluate MixNMatch’s learned representation for clustering real images into fine-grained object categories. We compare to state-of-the-art unsupervised deep clustering methods: FineGAN [singh-cvpr2019], JULE [yang-cvpr16], and DEPICT [dizaji-iccv17], and their stronger variants [singh-cvpr2019]: JULE-Res50 and DEPICT-Large. For evaluation metrics, we use Normalized Mutual Information (NMI[xu-sigir03] and Accuracy [dizaji-iccv17], which measures the best mapping between predicted and ground truth labels.

To cluster real images, we use MixNMatch’s p (shape) and c (texture) encoders as fine-grained feature extractors. For each image, we concatenate their L2-normalized penultimate features, and run k-means clustering with k = # of ground-truth classes. MixNMatch’s features lead to significantly more accurate clusters than the baselines; see Table 2. JULE and DEPICT focus more on background and rough shape information instead of fine grained details, and thus have relatively low performance. Although FineGAN performs much better, to extract features for real images, it trains encoders post-hoc on fake images to repredict their corresponding latent codes (as it cannot directly condition its generator on real images) [singh-cvpr2019]. Thus, there is a domain gap to the real image domain. In contrast, MixNMatch’s encoders are trained to extract features from both real and fake images, so it does not suffer from domain differences.

Figure 7: cartoon2img. Note how MixNMatch automatically learns part semantics, without supervision; e.g., in the second column, the colors of the texture reference (2nd row) are accurately transferred in the generation.
NMI Accuracy
Birds Dogs Cars Birds Dogs Cars
JULE [yang-cvpr16] 0.204 0.142 0.232 0.045 0.043 0.046
JULE-ResNet-50 [yang-cvpr16] 0.203 0.148 0.237 0.044 0.044 0.050
DEPICT [dizaji-iccv17] 0.290 0.182 0.329 0.061 0.052 0.063
DEPICT-Large [dizaji-iccv17] 0.297 0.183 0.330 0.061 0.054 0.062
FineGAN [dizaji-iccv17] 0.403 0.233 0.354 0.126 0.079 0.078
MixNMatch (Ours) 0.426 0.299 0.364 0.137 0.109 0.084
Table 2: Fine-grained object clustering. Our approach outperforms state-of-the-art clustering methods.

Shape and texture disentanglement.

In order to quantitatively evaluate MixNMatch’s disentanglement of shape and texture, we propose the following evaluation metric: We randomly sample 5000 image pairs (A, B) and generate new images C, which take texture and background (codes c, b) from image A, and shape and pose from image B (codes p, z). If a model disentangles these factors well and preserves them in the generated images, then the position of part keypoints (e.g., beak, tail) in B should be close to that in C, while the texture of those keypoints in A should be similar to that in C; see Fig. 9.

To measure how well shape is preserved, we train a keypoint detector [he-iccv2017] on CUB, and use it to detect 15 keypoints in generated image C. We then calculate the L2-distance (in x,y coordinate space) to the corresponding visible keypoints in image B. To measure how well texture is preserved, for each keypoint in images A and C, we first crop a 16x16 patch centered on it. We then compute the χ2-distance between the L1-normalized color histograms of the corresponding patches in A and C. See supp. for more details.

Table 3 (top) shows the results averaged over all 15 keypoints among all 5000 image triplets. We compare to FineGAN [singh-cvpr2019], SC-GAN [kazemi-wacv2018], a generative model that disentangles style (texture) and content (geometrical information), and Deforming AE [shu-eccv2018], a generative autoencoder that disentangles shape and texture from real images via unsupervised deformation constraints. Fig. 9 shows qualitative comparisons. The results clearly indicate that MixNMatch can better disentangle and preserve shape and texture compared to the baselines. SC-GAN does not explicitly differentiate background and foreground and uses a condensed code space to model content and style, so it has difficulty transferring texture and shape accurately. Deforming AE fails because its assumption that an image can be factorized into a canonical template and a deformation field is difficult to realize in complicated shapes such as birds in CUB. Finally, FineGAN performs better than these methods, but it again is hindered by the domain gap.

Figure 8: image2gif. MixNMatch can combine the pose factor z from a reference video (top row), with the other factors in a static image (1st column) to animate the object.
Figure 9: Shape & texture disentanglement. Our approach preserves shape, texture better than strong baselines.
Shape Texture
Deforming AE [shu-eccv2018] 69.97 0.792
SC-GAN [kazemi-wacv2018] 32.37 0.641
FineGAN [singh-cvpr2019] 21.04 0.602
MixNMatch (Ours) 17.58 0.577

Code mode w/o paired adv loss
64.76 0.677
Code mode w/o code reprediction 47.28 0.708
Code mode w/ code constraint 43.26 0.592
Feature mode w/o L1 loss 57.82 0.626
Feature mode w/o adv loss 23.86 0.602
Table 3: Shape & texture disentanglement. (Top) Comparisons to baselines. (Bottom) Ablation studies. We report keypoint L2-distance and color histogram χ2-distance for measuring shape and texture disentanglement ( better).

Ablation Studies.

Finally, we study MixNMatch’s various components: 1) no paired image-code adversarial loss, where we do not have Eqn. 2, instead we directly feed the predicted code from encoder to the generator, and apply an L1 loss between the generated and real images; 2) without code reprediction loss, where we do not apply Eqn. 3; 3) with code reprediction loss but with code constraints, where during generating fake images, we keep FineGAN’s code constraints; 4) without feature mode L1 loss, where we only apply an adversarial loss between S(x) and ϕ(y); 5) without feature mode adversarial loss, where we only have the L1 loss in feature mode training.

Table 3 (bottom) shows that all losses are necessary in code mode (first stage) training; otherwise, disentanglement cannot be learned properly. In feature mode (second stage) training, both adversarial and L1 losses are helpful, as they adapt the model to the real image domain to extract precise shape + pose information from the reference image.


There are some limitations worth discussing. First, our generated background can miss large structures, as we use a patch-level background discriminator. Second, the feature mode (second stage) training, depends on, and is sensitive to, how well the model is trained in the code mode (first stage). Third, z controls the size of the object in addition to pose, and this may make the projected shape in the generated image look like it’s not matching the shape p reference image (e.g. a big bird from far away will look small). Finally, for reference images whose background and object texture are very similar, our model can fail to produce a good object mask, and thus generate an incomplete object.


This work was supported in part by NSF CAREER IIS-1751206, IIS-1748387, AWS ML Research Award, Adobe Data Science Research Award, and Google Cloud Platform research credits.



In this supplementary material, we first introduce some key points of our training details. Next, we elaborate on our model’s feature mode (second stage) training. Then, in Sec. 3, we discuss the usage of bounding box annotations during training for background generation. In Sec. 4, we provide details on texture disentanglement, and report shape and texture disentanglement results for all 15 keypoints for all methods. In addition, we also compare the performance of our model in two modes (code and feature). Finally, in the last two sections, we show more qualitative disentanglement results and discuss the video clips which further demonstrate the disentangelemnt ability of our model.

1 Training details

We optimize our model using Adam with learning rate 0.0002, β1=0.5, β2=0.999 for 600 epochs. Following FineGAN [singh-cvpr2019], we crop all the images to 1.5× of their available bounding boxes.

As mentioned in the main paper, in our code mode (first stage) training, we use four paired discriminators to help encoders learn disentanglement. For each paired discriminator, there are two initial branches of convolution blocks which process the code and image, respectively. Then, their outputs are concatenated and fed into a series of convolution blocks to predict whether the input image-code pair is real or fake (during training, we set the image-code pair from encoders as real, and the image-code pair from generator as fake). In the code branch, we add Gaussian noise after each activation layer in order to avoid the discriminator from trivially recognizing that the one hot code in image-code pair from generator is a fake (since the encoded code from the encoders will never be one hot). Also, we update the paired discriminator using Wasserstein GAN [gulrajani-nips17] with gradient penalty.

Figure 10: MixNMatch architecture for feature mode (second stage) training. We take an intermediate feature—shown in blue—from the parent generator (with parameters fixed) as the real distribution, and train a shape feature extractor S to predict that feature via an adversarial loss.
Deforming AE [shu-eccv2018] SC-GAN [kazemi-wacv2018] FineGAN [singh-cvpr2019] MixNMatch (c) MixNMatch (f)
shape texture shape texture shape texture shape texture shape texture
back 75.08 0.816 27.60 0.679 16.69 0.637 19.40 0.612 14.66 0.615
beak 62.54 0.707 32.92 0.565 21.16 0.599 25.04 0.552 14.76 0.553
belly 61.58 0.873 30.86 0.778 19.56 0.683 22.59 0.674 17.80 0.651
breast 66.93 0.859 33.36 0.757 18.81 0.669 22.68 0.646 16.98 0.657
crown 81.75 0.773 32.52 0.631 19.31 0.614 22.75 0.578 14.23 0.577
forehead 70.64 0.759 29.29 0.572 18.67 0.570 22.49 0.518 12.83 0.519
left eye 66.13 0.809 27.84 0.586 17.87 0.540 20.87 0.492 13.44 0.493
left leg 70.32 0.800 34.53 0.573 26.03 0.585 26.86 0.535 23.61 0.550
left wing 68.53 0.809 34.98 0.714 25.40 0.609 27.24 0.612 24.79 0.625
nape 80.72 0.807 32.05 0.675 18.27 0.613 22.08 0.591 15.31 0.601
right eye 54.14 0.810 28.53 0.587 17.66 0.533 20.92 0.486 12.36 0.499
right leg 74.57 0.773 33.23 0.569 24.50 0.583 27.27 0.538 23.55 0.561
right wing 68.99 0.859 32.28 0.698 23.43 0.592 24.45 0.600 22.16 0.626
tail 67.42 0.635 42.52 0.591 28.97 0.617 29.53 0.566 22.29 0.571
throat 80.34 0.792 33.05 0.641 19.29 0.596 23.24 0.551 14.95 0.562
mean 69.98 0.792 32.37 0.641 21.04 0.602 23.83 0.570 17.58 0.577
Table 4: Shape & texture disentanglement. MixNMatch outperforms strong baselines in terms of both shape or texture disentanglement for all keypoints. (c) is code mode, (f) is feature mode.

2 Feature mode details

Fig. 10 shows our architecture of the feature mode (second stage) training where we only train a shape and pose feature extractor S. Concretely, we fix the trained code mode (first stage) MixNMatch generator, and treat it as a real feature distribution provider. We then randomly sample p and z codes from their prior code distribution (categorical and normal distribution, respectively) and also predict p and z using their trained encoders on randomly sampled real images with equal probability. We feed these codes into the fixed parent stage generator to get an intermediate feature ϕ(p,z) (we use the feature Fp outputted from generator Gp according to [singh-cvpr2019]). As this feature is the output of the parent stage generator, it only contains shape and pose information. Thus, by applying an adversarial loss on the feature extractor S to match the distribution of ϕ(p,z), we can extract shape and pose information from real images x.

As mentioned in the main paper, we use a patch discriminator for this feature mode (second stage) training; specifically, we use a patch size of 34 x 34. Finally, in order to preserve instance-specific shape and pose details, we also generate fake images using our pretrained MixNMatch generator and compute their ϕ(p,z). Then, for each fake image, we input it into the feature extractor S, and apply an L1 loss between the resulting feature and its ϕ(p,z). In summary, our loss to train S is:

𝒮=adv+L1 (4)

where adv=minSmaxDS𝔼ϕ(p,z)[log(DS(ϕ(p,z)))]+𝔼x[log(1-DS(S(x)))] and L1=|S(G(b,c,p,z))-ϕ(p,z)|. Here DS is the feature discriminator.

3 Background modeling

As mentioned in the main paper, we only use bounding box annotations during training to model the background. Since we do not have any background training images without the object-of-interest (e.g., trees without bird), for each training image, we treat patches that are completely outside of the bounding box annotated (object) region as being the “real” background patches. We then train the background generator Gb to generate realistic background images, by applying a patch-level background discriminator Db using the adversarial loss, following [singh-cvpr2019].

Once our model is trained, we do not need any bounding box annotations for image generation.

4 Shape & texture disentanglement evaluation

We first elaborate on how we evaluate texture disentanglement in Sec. 4.2 of the main paper. Recall that our goal is to take texture and background (codes c, b) from image A, shape and pose (codes p, z) from image B to generate new image C. In order to measure how well texture information is disentangled and preserved in generated image C, we first calculate 50 RGB cluster centers among 50,000 randomly sampled pixels from 1000 images (50 pixels per image) from the CUB dataset [wah-tech11]. We then fire our pre-trained keypoint detector, and crop a 16x16 patch centered on each keypoint from images A and C. For each patch, we compute its histogram representation by assigning each pixel to one of the color centers. Finally, we calculate the χ2-distance between the L1-normalized color histograms of the patch in image A and corresponding patch in image C. Since images A and B can have different poses and hence occluded parts, we only consider keypoints which are visible in both images.

Next, in Table 4, we evaluate shape and texture disentanglement for all 15 keypoints. MixNMatch consistently outperforms the baselines for all keypoints. Our feature mode has the best performance for shape disentanglement due to its ability of preserving instance-specific shape and pose details. In contrast, our code mode model has the best performance for the texture disentanglement. One reason that the feature mode texture disentanglement result is slightly worse than that of code mode is because MixNMatch in feature mode can sometimes generate suboptimal masks (due to very similar background and object texture in the shape and pose reference images), leading to incomplete image generations.

5 Additional results of varying a single factor

Figs. 11, 12, and 13 show additional disentanglement results of varying each factor for CUB, Dogs and Cars, respectively. These results supplement Fig. 4 from the main paper. In each sub-figure, images in the red boxes are real and we only change one factor indicated in the top left corner for generating the new images.

6 Video results

Finally, we include two videos demonstrating the disentanglement learned by MixNMatch. In MixNMatch.mp4, the four reference images on the top are real images which provide the four factors (background, shape, texture, and pose, respectively). The generated image is shown at the bottom. Each time we change one real reference image and smoothly translate the corresponding factor.

We also animate an object in a still image according to the movement of a different object from a reference video. In the two img2gif files, the frames from the reference video on the top is used to extract the z vector to control object pose and location. On the left, we have a reference image from which shape, background, and texture (p, b, c) information are extracted. These factors are combined by MixNMatch to generate the new images at the bottom.

Notice how our generated bird follows the pose of the reference video bird well – e.g., it turns around and lifts its head at the end. These results clearly indicate that our model can correctly disentangle pose information from the real images. Since MixNMatch is not trained on any video data and does not use any temporal information, the generated video can be a bit sensitive and unstable in terms of the bird’s shape/size. Still, overall, each generated frame captures the factors from the respective image/video-frame very well to produce a realistic image with the corresponding properties.

Figure 11: Varying a single factor. Real images are indicated with red boxes. For (a-d), the reference images on the left/top provide three/one factors. The center 5x5 images are generations.
Figure 12: Varying a single factor. Real images are indicated with red boxes. For (a-d), the reference images on the left/top provide three/one factors. The center 5x5 images are generations.
Figure 13: Varying a single factor. Real images are indicated with red boxes. For (a-d), the reference images on the left/top provide three/one factors. The center 5x5 images are generations.