We present MixNMatch, a conditional generative model that learns todisentangle and encode background, object pose, shape, and texture from realimages with minimal supervision, for mix-and-match image generation. We buildupon FineGAN, an unconditional generative model, to learn the desireddisentanglement and image generator, and leverage adversarial joint image-codedistribution matching to learn the latent factor encoders. MixNMatch requiresbounding boxes during training to model background, but requires no othersupervision. Through extensive experiments, we demonstrate MixNMatch's abilityto accurately disentangle, encode, and combine multiple factors formix-and-match image generation, including sketch2color, cartoon2img, andimg2gif applications. Our code/models/demo can be found athttps://github.com/Yuheng-Li/MixNMatch
Quick Read (beta)
MixNMatch: Multifactor Disentanglement and Encoding
for Conditional Image Generation
We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN [singh-cvpr2019], an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching [donahue-iclr2017, dumoulin-iclr2017] to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch’s ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch
Consider the real image of the yellow bird in Figure 1, 1st column. What would the bird look like in a different background, say that of the duck? How about in a different texture, perhaps that of the rainbow textured bird in the 2nd column? What if we wanted to keep its texture, but change its shape to that of the rainbow bird, and background and pose to that of the duck, as in the 3rd column? How about sampling shape, pose, texture, and background from four different reference images and combining them to create an entirely new image (last column)?
While research in conditional image generation has made tremendous progress [Isola-cvpr2017, zhu-iccv2017, park-cvpr2019], no existing work can simultaneously disentangle background, object pose, object shape, and object texture with minimal supervision, so that these factors can be combined from multiple real images for fine-grained controllable image generation. Learning disentangled representations with minimal supervision is an extremely challenging problem, since the underlying factors that give rise to the data are often highly correlated and intertwined. Work that disentangle two such factors, by taking as input two reference images e.g., one for appearance and the other for pose, do exist [huang-eccv2018, joo-cvpr18, lee-eccv18, lorenz-cvpr2019, xiao-iccv2019], but they cannot disentangle other factors such as foreground vs. background appearance or pose vs. shape. Since only two factors can be controlled, these approaches cannot arbitrarily change, for example, the object’s background, shape, and texture, while keeping its pose the same. Others require strong supervision in the form of keypoint/pose/mask annotations [peng-iccv2017, balakrishnan-cvpr2018, ma-cvpr2018, esser-cvpr2018], which limits their scalability, and still fall short of disentangling all of the four factors outlined above.
Our proposed conditional generative model, MixNMatch, aims to fill this void. MixNMatch learns to disentangle and encode background, object pose, shape, and texture latent factors from real images, and importantly, does so with minimal human supervision. This allows, for example, each factor to be extracted from a different real image, and then combined together for mix-and-match image generation; see Fig. 1. During training, MixNMatch only requires a loose bounding box around the object to model background, but requires no other supervision for modeling the object’s pose, shape, and texture.
Our goal of mix-and-match image generation i.e., generating a single synthetic image that combines different factors from multiple real reference images, requires a framework that can simultaneously learn (1) an encoder that encodes latent factors from real images into a disentangled latent code space, and (2) a generator that takes latent factors from the disentangled code space for image generation. To learn the generator and the disentangled code space, we build upon FineGAN [singh-cvpr2019], a generative model that learns to hierarchically disentangle background, object pose, shape, and texture with minimal supervision using information theory. However, FineGAN is conditioned only on latent random codes, and cannot be directly conditioned on real images for image generation. We therefore need a way to extract latent codes that control background, object pose, shape, and texture from real images, while preserving FineGAN’s hierarchical disentanglement properties. As we show in the experiments, a naive extension of FineGAN in which an encoder is trained to map a fake image into the codes that generated it is insufficient due to the domain gap between real and fake images.
To simultaneously achieve the above dual goals, we instead perform adversarial learning, whereby the joint distribution of real images and their extracted latent codes from the encoder, and the joint distribution of sampled latent random codes and corresponding generated images from the generator, are learned to be indistinguishable, similar to ALI [dumoulin-iclr2017] and BiGAN [donahue-iclr2017]. By enforcing matching joint image-code distributions, the encoder learns to produce latent codes that match the distribution of sampled codes with the desired distentanglement properties, while the generator learns to produce realistic images. To further encode a reference image’s shape and pose factors with high fidelity, we augment MixNMatch with a feature mode in which higher dimensional features of the image, rather than low dimensional codes, that preserve pixel-level structure are mapped onto a richer form of the learned disentangled code space, again via distribution matching using adversarial learning.
(1) We introduce MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture factors from real images with minimal human supervision. This gives MixNMatch fine-grained control in image generation, where each factor can be uniquely controlled. MixNMatch can take as input either real reference images, sampled latent codes, or a mix of both. (2) Through various qualitative and quantitative evaluations, we demonstrate MixNMatch’s ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation. Furthermore, we show that MixNMatch’s learned disentangled representation leads to state-of-the-art fine-grained object category clustering results of real images. (3) We demonstrate a number of interesting applications of MixNMatch including sketch2color, cartoon2img, and img2gif.
2 Related work
Conditional image generation
has various forms, including models conditioned on a class label [odena-icml2017, miyato-iclr2018, brock-iclr2019] or text input [reed-icml2016, stackgan2, xu-cvpr2018, yin-cvpr2019]. A lot of work focuses on image-to-image translation, where an image from one domain is mapped onto another domain e.g., [Isola-cvpr2017, zhu-iccv2017, park-cvpr2019]. However, these methods typically lack the ability to explicitly disentangle the factors of variation in the data. Those that do learn disentangled representations focus on specific domains like faces/humans [tran-cvpr2017, peng-iccv2017, bao-cvpr2018, pumarola-eccv2018, balakrishnan-cvpr2018, ma-cvpr2018] or require clearly defined domains (e.g., pose vs. identity or style/attribute vs. content) [joo-cvpr18, huang-eccv2018, lee-eccv18, gonzalez-nips2018, liu-nips2018, xiao-iccv2019]. In contrast, MixNMatch is not specific to any object category, and does not require clearly defined domains as it disentangles multiple factors of variation within a single domain (e.g., natural images of birds). Moreover, unlike most unsupervised methods which can disentangle only two factors like shape and appearance [li-ijcai2018, shu-eccv2018, lorenz-cvpr2019], MixNMatch can disentangle four (background, object shape, pose, and texture).
Disentangled representation learning
aims to disentangle the underlying factors that give rise to real world data [chen-nips16, yan-eccv16, xing-cvpr2018, li-ijcai2018, shu-eccv2018, tulyakov-cvpr18, hu-cvpr18, karras-cvpr2019, lorenz-cvpr2019]. Most unsupervised methods are limited to disentangling at most two factors like shape and texture [li-ijcai2018, shu-eccv2018]. Others require strong supervision in the form of edge/keypoint/mask annotations or detectors [peng-iccv2017, balakrishnan-cvpr2018, ma-cvpr2018, esser-cvpr2018], or rely on video to automatically acquire identity labels [denton-nips2017, joo-cvpr18, xiao-iccv2019]. Our most related work is FineGAN [singh-cvpr2019], which leverages information theory [chen-nips16] to disentangle background, object pose, shape, and texture with minimal supervision. However, it is conditioned only on latent codes, and thus cannot perform image translation. We build upon this work to enable conditioning on real images. Importantly, we show that a naive extension is insufficient, and further improve the quality of our model’s image generations to preserve instance specific details from the reference images. Since MixNMatch is directly conditioned on images, its learned representation leads to better disentanglement and fine-grained clustering of real images.
Let be an unlabeled image collection of a single object category (e.g., birds). Our goal is to learn a conditional generative model, MixNMatch, which simultaneously learns to (1) encode latent background, object pose, shape, and texture factors associated with images in into a disentangled latent code space (i.e., where each factor is uniquely controlled by a code), and (2) generate high quality images matching the true data distribution by combining latent factors from the disentangled code space.
We first briefly review FineGAN [singh-cvpr2019], from which we base our generator. We then explain how to train our model to disentangle and encode background, object pose, shape, and texture from real images, so that it can combine different factors from different real reference images for mix-and-match image generation. Lastly, we introduce how to augment our model to preserve object shape and pose information from a reference image with high fidelity (i.e., at the pixel-level), while still altering the background and object texture according to their respective reference images.
3.1 Background: FineGAN
FineGAN [singh-cvpr2019] takes as input four randomly sampled latent codes (, , , ) to hierarchically generate an image in three stages: (1) a background stage where the model only generates the background, conditioned on latent one-hot background code ; (2) a parent stage where the model generates the object’s shape, conditioned on latent one-hot parent code , and stitches it to the existing background image; and (3) a child stage where the model fills in the object’s texture, conditioned on latent one-hot child code . In both the parent and child stages, FineGAN automatically generates masks (without any mask supervision) to capture the appropriate shape and texture details. To disentangle the background, it relies on object bounding boxes (e.g., acquired through an object detector). To disentangle the remaining factors of variation without any supervision, FineGAN uses information theory (similar to InfoGAN [chen-nips16]), and imposes specific constraints on the relationships between the latent codes (detailed in Sec. 3.3). These induce the random noise vector , background code , parent code , and child code to capture the object pose, background, object shape, and object texture, respectively.
FineGAN is trained with three losses, one for each stage, which combine adversarial training [goodfellow-nips2014] and mutual information maximization [chen-nips16]. We simply denote its full loss as:
where , , and denote the losses in the background, parent, and child stages. For more details on these losses and the FineGAN architecture, please refer to [singh-cvpr2019].
3.2 Paired image-code distribution matching
Although FineGAN can disentangle multiple factors to generate realistic images, it is conditioned on sampled latent codes, and cannot be conditioned on real images. A naive post-processing extension in which encoders that learn to map fake images to the codes that generated them is insufficient due to the domain gap between real and fake images [singh-cvpr2019], as we show in our experiments.
Thus, in order to encode disentangled representations from real image inputs for conditional mix-and-match image generation, we need a way to extract the random vector (which controls object pose), (which controls the background), (which controls object shape), and (which controls object texture) codes from real images, while preserving the hierarchical disentanglement properties of FineGAN. For this, we propose to train four encoders, each of which predict the codes from real input images. Since FineGAN has the ability to disentangle factors and generate images given random latent codes, we naturally resort to using it as our generator, by keeping all the losses (i.e., ) in the original framework to help the encoders learn the desired disentanglement.
Specifically, for each real training image , we use the corresponding encoders to extract its codes. However, we cannot simply input these codes to the generator to reconstruct the image, as the model would take a shortcut and degenerate into a simple autoencoder that does not preserve FineGAN’s disentanglement properties (factorization into background, pose, shape, texture), as we show in our experiments. We therefore leverage ideas from ALI [dumoulin-iclr2017] and BiGAN [donahue-iclr2017] to help the encoders learn the inverse mapping; i.e., a projection of real images into the code space, in a way that maintains the desired disentanglement properties.
The key idea is to perform adversarial learning [goodfellow-nips2014, donahue-iclr2017, dumoulin-iclr2017], so that the paired image-code distribution produced by the encoder and the paired image-code distribution produced by the generator are matched. Here is the encoder, is the FineGAN generator, and is a place holder for the latent codes . is the data (real image) distribution and is the latent code distribution.11 1 Following FineGAN [singh-cvpr2019]: a continuous noise vector ; a categorical background code ; a categorical parent code ; and a categorical child code ). , , are the number of background, parent, and child categories and are set as hyperparameters. Formally, the input to the discriminator is an image-code pair. When training , we set the paired real image and code extracted from the encoder to be real, and the paired sampled input code and generated image from the generator to be fake. Conversely, when training and , we try to fool so that the paired distributions and are indistinguishable, via a paired adversarial loss:
This loss will simultaneously enforce the (1) generated images to look real, and (2) extracted real image codes to capture the desired factors (i.e., pose, background, shape, appearance). Fig. 2 (a-c) show our encoders, generator, and discriminators.
3.3 Relaxing the latent code constraints
There is an important issue that we must address to ensure disentanglement in the extracted codes. FineGAN imposes strict code relationship constraints, which are key to inducing the desired disentanglement in an unsupervised way, but which can be difficult to realize in all real images. Specifically, during training, FineGAN constrains the sampled child codes into disjoint groups so that each group has the same unique parent code, and enforces the sampled background and child codes for each generated image to be the same [singh-cvpr2019]. This is because objects often differ in texture conditioned on a shared shape (e.g., different duck species share the same shape but differ in their texture details), and background is often correlated with specific object types (e.g., flying birds typically have sky as background).
However, for any real image, these strict relationships may not hold (e.g., a flying bird with trees in background), and would thus be difficult to enforce in its extracted codes. In this case, the discriminator would easily be able to tell whether the image-code pair is real or fake (based on the code relationships), which will cause issues with learning. It can also confuse the background and texture encoders since the background and child latent codes are always sampled to be the same.
We address this issue in two ways. First, we train four separate discriminators, one for each code type. This prevents any discriminator from seeing the other codes, and thus cannot discriminate based on the relationships between codes. Second, when training the encoders, we also provide as input fake images that are generated with randomly sampled codes with the code constraints removed. Specifically, we train the encoders to predict back the sampled codes that were used to generate the corresponding fake image:
where denotes cross-entropy loss, and is a place holder for the latent codes . (For continuous , we use L1 loss.) This loss helps to guide each encoder, and in particular the and encoders, to learn the corresponding factor. Note that the above loss is used only to update the encoders , as these fake images can have feature combinations that generally do not exist in the real data distribution (e.g., a duck on top of a tree).
3.4 Optional feature mode for exact shape and pose
Thus far, MixNMatch’s encoders can take in up to four different real images and encode them into codes which model the background, object pose, shape, and texture, respectively. These codes can then be used by MixNMatch’s generator to generate realistic images, which combine the four factors from the respective reference images. We denote this setting as MixNMatch’s code mode. While the generated images already capture the factors with high accuracy (see Fig. 3, “code mode”), certain image translation applications may require exact pixel-level shape and pose alignment between a reference image and the output.
The main reason that MixNMatch in code mode cannot preserve exact pixel-level shape and pose details of a reference image is because the capacity of the latent code space is too small to model per-instance pixel-level details (typically, tens in dimension for , which is responsible for capturing shape). The reason it must be small is because it must (roughly) match the e.g., number of unique modes of the corresponding factor. In this section, we introduce MixNMatch’s optional feature mode to address this. Rather than encode a reference image into a low-dimensional code, the key idea is to directly learn a mapping from the image to a higher-dimensional feature space that preserves rich shape and pose (pixel-level) details.
Specifically, we take our learned MixNMatch generator , and use it to train a new shape and pose feature extractor , which takes as input a real image and outputs feature . Recall that takes as input a code to generate the image ; i.e. . Let’s denote an intermediate parent stage feature (which captures shape and pose) from the generator as ; i.e. . We use the standard adversarial loss [goodfellow-nips2014] to train so that the distribution of matches that of (i.e., only is learned and is produced from the already trained ). Ultimately, this trains to produce features that match those sampled from the distribution, which already has learned to encode shape and pose. Next, to enforce to preserve instance-specific shape and pose details of (i.e., so that the resulting generated image using is spatially-aligned to ), we randomly generate fake images using , and for each fake image , we enforce an L1 loss between the feature generated using the sampled codes and the feature . See supp. for the network figure.
Once trained, we can use MixNMatch’s feature mode to extract the pixel-aligned pose and shape feature from an input image , and combine it with the background and texture codes extracted from (up to) two reference images, to perform conditional mix-and-match image generation.
We evaluate MixNMatch’s conditional mix-and-match image generation results, its ability to disentangle each latent factor, and its learned representation for fine-grained object clustering of real images. We also showcase sketch2color, cartoon2img, and img2gif applications.
(1) CUB [wah-tech11]: 11,788 bird images from 200 classes; (2) Stanford Dogs [khosla-FGVC11]: 12,000 dog images from 120 classes; (3) Stanford Cars [krause-DRR2013]: 8,144 car images from 196 classes. We set the prior latent code distributions following FineGAN [singh-cvpr2019]. The only supervision we use is bounding boxes to model background during training.
We compare to a number of state-of-the-art GAN, disentanglement, and clustering methods. For all methods, we use the authors’ public code. The code for SC-GAN [kazemi-wacv2018] only has the unconditional version, so we implement its BiGAN [donahue-iclr2017] variant following the paper details.
We train and generate x images. In feature mode (2nd stage) training, is a learned distribution from the code mode (1st stage) and may not model the entire real feature distribution (e.g., due to mode collapse). Thus, we assume that patch-level features are better modeled, and apply a patch discriminator. For our feature mode, since the predicted object masks are often highly accurate, we can optionally directly stitch the foreground (if only changing background) or background (if only changing texture) from the corresponding reference image. When optimizing Eqn. 2, we add noise to since the sampled , , are one hot, while predicted , , will never be one-hot. Full training details are in the supp.
4.1 Qualitative Results
Conditional mix-and-match image generation.
We show results on CUB, Stanford Cars, and Stanford Dogs; see Fig. 3. The first three rows show the background, texture, and shape + pose reference (real) images from which our model extracts , , and & , respectively, while the fourth and fifth rows show MixNMatch’s feature mode and code mode generation results, respectively.
Our feature mode results (4th row) demonstrate how well MixNMatch preserves shape and pose information from the reference images (3th rows), while transferring background and texture information (from 1st and 2nd rows). For example, the generated bird in the second column preserves the exact pose and shape of the bird standing on the pipe (3rd row) and transfers the brownish bark background and rainbow object texture from the 1st and 2nd row images, respectively. Our code mode results (5th row) also capture the different factors from the reference images well, though not as well as the feature mode for pose and shape. Thus, this mode is more useful for applications in which inexact instance-level pose and shape transfer is acceptable (e.g., generating a completely new instance which captures the factors at a high-level). Overall, these results highlight how well MixNMatch disentangles and encodes factors from real images, and preserves them in the generation.
Note that here we take both and from the same reference image (row 3) in order to perform a direct comparison between the code and feature modes. We next show results of disentangling all four factors, including and .
Disentanglement of factors.
Here we evaluate how well MixNMatch disentangles each factor (background , texture , pose , shape ). Fig. 4 shows our disentanglement of each factor on CUB (results for Dogs and Cars are in the supp.). For each subfigure, the images in the top row and leftmost column (with red borders) are real reference images. The specific factors taken from each image are indicated in the top-left corner; e.g., in (a), pose is taken from the top row, while background, shape, texture are taken from the leftmost column. Note how we can make (a) a bird change poses by varying , (b) change just the background by varying , (c) colorize by varying , and (d) change shape by varying (e.g., see the duck example in 3rd column).
Latent code interpolation.
In Fig. 5 we encode the , , , codes from the two real images (first and last columns), linearly interpolate each code, and generate the interpolated images. MixNMatch produces perceptually smooth transitions for each factor, which again suggests that it has learned a highly disentangled latent space [karras-cvpr2019].
|Simple-GAN||31.85 0.17||6.75 0.07||20.92 0.14||16.69||261.85||33.35|
|InfoGAN [chen-nips16]||47.32 0.77||43.16 0.42||28.62 0.44||13.20||29.34||17.63|
|LR-GAN [yang-iclr17]||13.50 0.20||10.22 0.21||5.25 0.05||34.91||54.91||88.80|
|StackGANv2 [stackgan2]||43.47 0.74||37.29 0.56||33.69 0.44||13.60||31.39||16.28|
|FineGAN [singh-cvpr2019]||52.53 0.45||46.92 0.61||32.62 0.37||11.25||25.66||16.03|
|MixNMatch (Ours)||53.03 0.55||47.16 0.67||32.68 0.47||12.49||20.98||17.23|
sketch2color / cartoon2img.
MixNMatch can also adapt to other domains not seen during training. Figs. 6 and 7 show results where shape and pose information are taken from sketch / cartoon images. The results also indicate that MixNMatch learns part information, without supervision. For example, in Fig. 7 column 2, it can correctly transfer the black, white, and red part colors to the rubber duck.
MixNMatch can also be used to animate a static image; see Fig. 8 and supp. for a video result.
4.2 Quantitative Results
Image diversity and quality.
We compute Inception Score [salimans-nips16] and FID [FID] over 30K randomly generated images. We condition the generation only on sampled latent codes (by sampling , , , from their prior distributions; see Footnote 1), and not on real image inputs, for a fair comparison. Table 1 shows that MixNMatch generates diverse and realistic images that are competitive to state-of-the-art unconditional GAN methods.
Fine-grained object clustering.
We next evaluate MixNMatch’s learned representation for clustering real images into fine-grained object categories. We compare to state-of-the-art unsupervised deep clustering methods: FineGAN [singh-cvpr2019], JULE [yang-cvpr16], and DEPICT [dizaji-iccv17], and their stronger variants [singh-cvpr2019]: JULE-Res50 and DEPICT-Large. For evaluation metrics, we use Normalized Mutual Information (NMI) [xu-sigir03] and Accuracy [dizaji-iccv17], which measures the best mapping between predicted and ground truth labels.
To cluster real images, we use MixNMatch’s (shape) and (texture) encoders as fine-grained feature extractors. For each image, we concatenate their L2-normalized penultimate features, and run -means clustering with = # of ground-truth classes. MixNMatch’s features lead to significantly more accurate clusters than the baselines; see Table 2. JULE and DEPICT focus more on background and rough shape information instead of fine grained details, and thus have relatively low performance. Although FineGAN performs much better, to extract features for real images, it trains encoders post-hoc on fake images to repredict their corresponding latent codes (as it cannot directly condition its generator on real images) [singh-cvpr2019]. Thus, there is a domain gap to the real image domain. In contrast, MixNMatch’s encoders are trained to extract features from both real and fake images, so it does not suffer from domain differences.
Shape and texture disentanglement.
In order to quantitatively evaluate MixNMatch’s disentanglement of shape and texture, we propose the following evaluation metric: We randomly sample 5000 image pairs (A, B) and generate new images C, which take texture and background (codes , ) from image A, and shape and pose from image B (codes , ). If a model disentangles these factors well and preserves them in the generated images, then the position of part keypoints (e.g., beak, tail) in B should be close to that in C, while the texture of those keypoints in A should be similar to that in C; see Fig. 9.
To measure how well shape is preserved, we train a keypoint detector [he-iccv2017] on CUB, and use it to detect 15 keypoints in generated image C. We then calculate the L2-distance (in x,y coordinate space) to the corresponding visible keypoints in image B. To measure how well texture is preserved, for each keypoint in images A and C, we first crop a 16x16 patch centered on it. We then compute the -distance between the L1-normalized color histograms of the corresponding patches in A and C. See supp. for more details.
Table 3 (top) shows the results averaged over all 15 keypoints among all 5000 image triplets. We compare to FineGAN [singh-cvpr2019], SC-GAN [kazemi-wacv2018], a generative model that disentangles style (texture) and content (geometrical information), and Deforming AE [shu-eccv2018], a generative autoencoder that disentangles shape and texture from real images via unsupervised deformation constraints. Fig. 9 shows qualitative comparisons. The results clearly indicate that MixNMatch can better disentangle and preserve shape and texture compared to the baselines. SC-GAN does not explicitly differentiate background and foreground and uses a condensed code space to model content and style, so it has difficulty transferring texture and shape accurately. Deforming AE fails because its assumption that an image can be factorized into a canonical template and a deformation field is difficult to realize in complicated shapes such as birds in CUB. Finally, FineGAN performs better than these methods, but it again is hindered by the domain gap.
|Deforming AE [shu-eccv2018]||69.97||0.792|
Code mode w/o paired adv loss
|Code mode w/o code reprediction||47.28||0.708|
|Code mode w/ code constraint||43.26||0.592|
|Feature mode w/o L1 loss||57.82||0.626|
|Feature mode w/o adv loss||23.86||0.602|
Finally, we study MixNMatch’s various components: 1) no paired image-code adversarial loss, where we do not have Eqn. 2, instead we directly feed the predicted code from encoder to the generator, and apply an L1 loss between the generated and real images; 2) without code reprediction loss, where we do not apply Eqn. 3; 3) with code reprediction loss but with code constraints, where during generating fake images, we keep FineGAN’s code constraints; 4) without feature mode L1 loss, where we only apply an adversarial loss between and ; 5) without feature mode adversarial loss, where we only have the L1 loss in feature mode training.
Table 3 (bottom) shows that all losses are necessary in code mode (first stage) training; otherwise, disentanglement cannot be learned properly. In feature mode (second stage) training, both adversarial and L1 losses are helpful, as they adapt the model to the real image domain to extract precise shape + pose information from the reference image.
There are some limitations worth discussing. First, our generated background can miss large structures, as we use a patch-level background discriminator. Second, the feature mode (second stage) training, depends on, and is sensitive to, how well the model is trained in the code mode (first stage). Third, controls the size of the object in addition to pose, and this may make the projected shape in the generated image look like it’s not matching the shape reference image (e.g. a big bird from far away will look small). Finally, for reference images whose background and object texture are very similar, our model can fail to produce a good object mask, and thus generate an incomplete object.
This work was supported in part by NSF CAREER IIS-1751206, IIS-1748387, AWS ML Research Award, Adobe Data Science Research Award, and Google Cloud Platform research credits.
In this supplementary material, we first introduce some key points of our training details. Next, we elaborate on our model’s feature mode (second stage) training. Then, in Sec. 3, we discuss the usage of bounding box annotations during training for background generation. In Sec. 4, we provide details on texture disentanglement, and report shape and texture disentanglement results for all 15 keypoints for all methods. In addition, we also compare the performance of our model in two modes (code and feature). Finally, in the last two sections, we show more qualitative disentanglement results and discuss the video clips which further demonstrate the disentangelemnt ability of our model.
1 Training details
We optimize our model using Adam with learning rate , , for 600 epochs. Following FineGAN [singh-cvpr2019], we crop all the images to 1.5 of their available bounding boxes.
As mentioned in the main paper, in our code mode (first stage) training, we use four paired discriminators to help encoders learn disentanglement. For each paired discriminator, there are two initial branches of convolution blocks which process the code and image, respectively. Then, their outputs are concatenated and fed into a series of convolution blocks to predict whether the input image-code pair is real or fake (during training, we set the image-code pair from encoders as real, and the image-code pair from generator as fake). In the code branch, we add Gaussian noise after each activation layer in order to avoid the discriminator from trivially recognizing that the one hot code in image-code pair from generator is a fake (since the encoded code from the encoders will never be one hot). Also, we update the paired discriminator using Wasserstein GAN [gulrajani-nips17] with gradient penalty.
|Deforming AE [shu-eccv2018]||SC-GAN [kazemi-wacv2018]||FineGAN [singh-cvpr2019]||MixNMatch (c)||MixNMatch (f)|
2 Feature mode details
Fig. 10 shows our architecture of the feature mode (second stage) training where we only train a shape and pose feature extractor . Concretely, we fix the trained code mode (first stage) MixNMatch generator, and treat it as a real feature distribution provider. We then randomly sample and codes from their prior code distribution (categorical and normal distribution, respectively) and also predict and using their trained encoders on randomly sampled real images with equal probability. We feed these codes into the fixed parent stage generator to get an intermediate feature (we use the feature outputted from generator according to [singh-cvpr2019]). As this feature is the output of the parent stage generator, it only contains shape and pose information. Thus, by applying an adversarial loss on the feature extractor to match the distribution of , we can extract shape and pose information from real images .
As mentioned in the main paper, we use a patch discriminator for this feature mode (second stage) training; specifically, we use a patch size of x . Finally, in order to preserve instance-specific shape and pose details, we also generate fake images using our pretrained MixNMatch generator and compute their . Then, for each fake image, we input it into the feature extractor , and apply an L1 loss between the resulting feature and its . In summary, our loss to train is:
where and . Here is the feature discriminator.
3 Background modeling
As mentioned in the main paper, we only use bounding box annotations during training to model the background. Since we do not have any background training images without the object-of-interest (e.g., trees without bird), for each training image, we treat patches that are completely outside of the bounding box annotated (object) region as being the “real” background patches. We then train the background generator to generate realistic background images, by applying a patch-level background discriminator using the adversarial loss, following [singh-cvpr2019].
Once our model is trained, we do not need any bounding box annotations for image generation.
4 Shape & texture disentanglement evaluation
We first elaborate on how we evaluate texture disentanglement in Sec. 4.2 of the main paper. Recall that our goal is to take texture and background (codes , ) from image A, shape and pose (codes , ) from image B to generate new image C. In order to measure how well texture information is disentangled and preserved in generated image C, we first calculate 50 RGB cluster centers among 50,000 randomly sampled pixels from 1000 images (50 pixels per image) from the CUB dataset [wah-tech11]. We then fire our pre-trained keypoint detector, and crop a 16x16 patch centered on each keypoint from images A and C. For each patch, we compute its histogram representation by assigning each pixel to one of the color centers. Finally, we calculate the -distance between the L1-normalized color histograms of the patch in image A and corresponding patch in image C. Since images A and B can have different poses and hence occluded parts, we only consider keypoints which are visible in both images.
Next, in Table 4, we evaluate shape and texture disentanglement for all 15 keypoints. MixNMatch consistently outperforms the baselines for all keypoints. Our feature mode has the best performance for shape disentanglement due to its ability of preserving instance-specific shape and pose details. In contrast, our code mode model has the best performance for the texture disentanglement. One reason that the feature mode texture disentanglement result is slightly worse than that of code mode is because MixNMatch in feature mode can sometimes generate suboptimal masks (due to very similar background and object texture in the shape and pose reference images), leading to incomplete image generations.
5 Additional results of varying a single factor
Figs. 11, 12, and 13 show additional disentanglement results of varying each factor for CUB, Dogs and Cars, respectively. These results supplement Fig. 4 from the main paper. In each sub-figure, images in the red boxes are real and we only change one factor indicated in the top left corner for generating the new images.
6 Video results
Finally, we include two videos demonstrating the disentanglement learned by MixNMatch. In MixNMatch.mp4, the four reference images on the top are real images which provide the four factors (background, shape, texture, and pose, respectively). The generated image is shown at the bottom. Each time we change one real reference image and smoothly translate the corresponding factor.
We also animate an object in a still image according to the movement of a different object from a reference video. In the two img2gif files, the frames from the reference video on the top is used to extract the vector to control object pose and location. On the left, we have a reference image from which shape, background, and texture (, , ) information are extracted. These factors are combined by MixNMatch to generate the new images at the bottom.
Notice how our generated bird follows the pose of the reference video bird well – e.g., it turns around and lifts its head at the end. These results clearly indicate that our model can correctly disentangle pose information from the real images. Since MixNMatch is not trained on any video data and does not use any temporal information, the generated video can be a bit sensitive and unstable in terms of the bird’s shape/size. Still, overall, each generated frame captures the factors from the respective image/video-frame very well to produce a realistic image with the corresponding properties.