In this work, we propose a novel Cycle In Cycle Generative AdversarialNetwork (C$^2$GAN) for the task of keypoint-guided image generation. Theproposed C$^2$GAN is a cross-modal framework exploring a joint exploitation ofthe keypoint and the image data in an interactive manner. C$^2$GAN contains twodifferent types of generators, i.e., keypoint-oriented generator andimage-oriented generator. Both of them are mutually connected in an end-to-endlearnable fashion and explicitly form three cycled sub-networks, i.e., oneimage generation cycle and two keypoint generation cycles. Each cycle not onlyaims at reconstructing the input domain, and also produces useful outputinvolving in the generation of another cycle. By so doing, the cycles constraineach other implicitly, which provides complementary information from the twodifferent modalities and brings extra supervision across cycles, thusfacilitating more robust optimization of the whole network. Extensiveexperimental results on two publicly available datasets, i.e., Radboud Facesand Market-1501, demonstrate that our approach is effective to generate morephoto-realistic images compared with state-of-the-art models.
Quick Read (beta)
Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation
In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (CGAN) for the task of keypoint-guided image generation. The proposed CGAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. CGAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involving in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network. Extensive experimental results on two publicly available datasets, i.e., Radboud Faces (Langner et al., 2010) and Market-1501 (Zheng et al., 2015), demonstrate that our approach is effective to generate more photo-realistic images compared with state-of-the-art models.
Humans have the ability to convert objects or scenes to another form just by imagining, while it is difficult for machines to deal with this task. For instance, we can easily generate mental images that have different facial expressions and human poses. In this paper, we study how to enable machines to perform image-to-image translation tasks, which has many application scenarios, such as human-computer interactions, entertainment, virtual reality and data augmentation. One important benefit of this task is that it can help to augment training data by generating diverse images with given input images, which thus could be employed to improve other recognition or detection tasks.
However, the task is still challenging since: (i) it needs to handle complex backgrounds with different illumination conditions, objects and occlusions; (ii) it needs a high-level semantic understanding of the mapping between the input images and the output images since the objects in the inputs may have arbitrary poses, sizes, locations and self-occlusions. Recently, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have shown the potential to solve this difficult task, and it can be utilized, for instance, to convert a face with a neutral expression into different expressions or to transfer a person with a specific pose into different poses. GANs have produced promising results in many generative tasks, such as photo-realistic image generation (Isola et al., 2017; Brock et al., 2018; Karras et al., 2019; Zhu et al., 2017), video generation (Chan et al., 2018; Siarohin et al., 2019; Mathieu et al., 2016; Vondrick et al., 2016; Yan et al., 2017; Wang et al., 2018a), text generation (Yu et al., 2017), audio generation (Van Den Oord et al., 2016) and image inpainting (Dolhansky and Canton Ferrer, 2018; Zhang et al., 2019). Recent works have developed powerful image translation systems, e.g., Pix2pix (Isola et al., 2017) and Pix2pixHD (Wang et al., 2018b) in supervised settings, where image pairs are required. However, paired training data are usually difficult and expensive to obtain. To tackle this problem, CycleGAN (Zhu et al., 2017), DualGAN (Yi et al., 2017) and DiscoGAN (Kim et al., 2017) provide an interesting insight, in which the models can learn the mapping from one image domain to another with unpaired data. However, these models encounter the efficiency issue. For instance, with different image domains, CycleGAN, DiscoGAN, DualGAN need to train generators and discriminators. While Pix2pix has to train generator/discriminator pairs. Recently, Anoosheh et al. propose ComboGAN (Anoosheh et al., 2018), which only needs to train generator/discriminator pairs in term of different image domains. Tang et al. (Tang et al., 2018) propose GGAN, in which a dual-generator and a discriminator can perform unpaired image-to-image translation for multiple domains. In addition, Choi et al. (Choi et al., 2018) propose StarGAN, a single generator/discriminator pair can perform unpaired image-to-image translation for multiple domains. While the computational complexity of StarGAN is , this model is not effective in handling some specific image-to-image translation tasks such as person image generation (Ma et al., 2017; Siarohin et al., 2018) and hand gesture generation (Tang et al., 2018), in which image generation could involve infinity image domains since human body and hand gesture in the wild can have arbitrary poses, sizes, appearances and locations.
To address these limitations, several works are proposed to generate images based on object keypoints or human skeletons. Keypoint/skeleton contains the object information of shapes and position, which can be used to produce more photo-realistic images. For instance, Reed et al. (Reed et al., 2016a) propose GAWWN model, which generates bird images conditioned on both text descriptions and object location. Qiao et al. (Qiao et al., 2018) present GCGAN to generate facial expression conditioned on geometry information of facial landmarks. Song et al. (Song et al., 2018) propose G2GAN for facial expression synthesis. Siarohin et al. (Siarohin et al., 2018) introduce PoseGAN for pose-based human image generation. Tang et al. (Tang et al., 2018) propose GestureGAN for skeleton-guided hand gesture generation. Ma et al. (Ma et al., 2017) propose PG, which can generate person images using a conditional image and a target pose. An illustrative comparison among PG (Ma et al., 2017), PoseGAN (Siarohin et al., 2018) and the proposed CGAN is shown in Fig. 2. PG tries to generate person images using target keypoints . For PoseGAN, which needs the target keypoints and original keypoints as conditional inputs. Both methods only employ keypoint information as input guidance.
Current state-of-the-art keypoint-guided image translation methods such as PG (Ma et al., 2017) and PoseGAN (Siarohin et al., 2018) have two main issues: (i) both only directly transfer from an original domain to a target domain, without considering the mutual translation between each other, while the translation across different modalities in a joint network would bring rich cross-modal information. (ii) both simply employ the keypoint information as input reference to guide the generation, without involving the generated keypoint information as supervisory signals to further improve the network optimization. Both issues lead to unsatisfactory results.
To address these limitations, we propose a novel Cycle In Cycle Generative Adversarial Network (CGAN), in which explicitly three cycled sub-networks are formed to learn the image translation crossing modalities in a unified network structure.
We have a basic image cycle, i.e., I2I2I (), which aims at reconstructing the input and further refine the generated images . The keypoint information in CGAN is not only utilized as input guidance but also act as output, meaning that the keypoint is also a generative objective. Input and output of the keypoint are connected by two keypoint cycles, i.e., K2G2K () and K2R2K (), where and denotes an image and a keypoint generator, respectively. In this way, keypoint cycles can provide weak supervision to the generated images . The intuition of the keypoint cycles is that if the generated keypoint is very close to the real keypoint, then the corresponding images should be similar. In other words, better keypoint generation will boost the image generation, and conversely the improved image generation can facilitate the keypoint generation. These three cycles inherently constraint each other in the network optimization in an end-to-end training fashion.
Moreover, for better optimization the three cycles we propose two novel cycle losses, i.e., Image Cycle-consistency loss (IC) and Keypoint Cycle-consistency loss (KC). With these cycle losses, each cycle can benefit from each other in joint learning. Moreover, we propose two cross-modal discriminators corresponding to the generators. We conduct extensive experiments on two different keypoint-guided image generation tasks, i.e., landmark-guided facial expression generation and keypoint-guided person pose generation. Extensive experimental results demonstrate that CGAN yields superior performance compared with state-of-the-art approaches.
In summary, the contribution of this paper is three-fold:
We propose a novel cross-modal generative adversarial network named Cycle In Cycle Generative Adversarial Network (CGAN) for keypoint-guided image generation task, which organizes the keypoint and the image data in an interactive generation manner in a joint deep network, instead of using the keypoint information only as a guidance for the input.
The cycle in cycle structure is a new network design which explores effective utilization of cross-modal information for the keypoint-guided image generation task. The designed cycled sub-networks connect different modalities, and implicitly constraint on each other, leading to extra supervision signals for better image generation. We also investigate cross-modal discriminators and cycle losses for more robust network optimization.
Extensive results on two challenging tasks, i.e., landmark-guided facial expression generation and keypoint-guided person pose generation demonstrate the effectiveness of the proposed CGAN, and show more photo-realistic image generation compared with existing competing models.
2. Related Work
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have shown the capability of generating high-quality images (Wang and Gupta, 2016; Karras et al., 2018; Gulrajani et al., 2017; Brock et al., 2018; Karras et al., 2019). Although it is successful in many tasks, it also has many challenges, such as how to control the content of the generated images. To generate meaningful images that meet user requirement, Conditional GAN (CGAN) (Mirza and Osindero, 2014) is proposed where the conditioned information is employed to guide the image generation process. A CGAN model always combines a vanilla GAN and an external information, such as discrete class labels or tags (Odena, 2016; Perarnau et al., 2016; Duan et al., 2019), text descriptions (Reed et al., 2016a; Mansimov et al., 2015; Reed et al., 2016b) semantic maps (Regmi and Borji, 2018; Wang et al., 2018b; Tang et al., 2019b; Park et al., 2019), conditional images (Isola et al., 2017), object masks (Mo et al., 2019) or attention maps (Tang et al., 2019a; Ma et al., 2018; Chen et al., 2018; Mejjati et al., 2018). However, existing CGANs synthesize images based on global constraints such as a class label, text description or facial attribute, they do not provide control over pose, object location or object shape.
Image-to-Image Translation models use input-output data to learn a parametric mapping between inputs and outputs, e.g., Isola et al. (Isola et al., 2017) propose Pix2pix, which employs a CGAN to learn a mapping function from input to output image domains. Wang et al. (Wang et al., 2018b) introduce Pix2pixHD model for synthesizing high-resolution images from semantic label maps. However, most of the tasks in the real world suffer from the constraint of having few or none of the paired input-output samples available. To overcome this limitation, the unpaired image-to-image translation task has been proposed. Different from the prior works, unpaired image-to-image translation task learns the mapping function without the requirement of paired training data, such as (Zhu et al., 2017; Taigman et al., 2017; Tang et al., 2018; Yi et al., 2017; Tang et al., 2019a; Kim et al., 2017; Zhou et al., 2017; Anoosheh et al., 2018). For instance, Zhu et al. (Zhu et al., 2017) introduce CycleGAN framework, which achieves unpaired image-to-image translation using the cycle-consistency loss. DualGAN is demonstrated in (Yi et al., 2017), in which there are image translators to be trained from two unlabeled image sets each representing an image domain. Kim et al. (Kim et al., 2017) propose a method based on GANs that learns to discover relations between different domains.
However, existing paired and unpaired image translation approaches are inefficient and ineffective as discussed in the introduction section. Most importantly, these aforementioned approaches cannot handle some specific image-to-image translation tasks such as person image generation problem (Ma et al., 2017; Siarohin et al., 2018), which could have infinity image domains since a person can have arbitrary poses, sizes, appearances and locations in the wild.
Keypoint-guided Image-to-Image Translation. To address these aforementioned limitations, several works (Ma et al., 2017; Siarohin et al., 2018; Reed et al., 2016c; Di et al., 2018; Yan et al., 2017) have been proposed to generate images based on object keypoint. For instance, Di et al. (Di et al., 2018) propose GPGAN to synthesize faces based on facial landmarks. Reed et al. (Reed et al., 2016c) present PixelCNN model to generate images part keypoints and text descriptions. Korshunova et al. (Korshunova et al., 2017) use facial keypoints to define the affine transformations of the alignment and realignment steps for face swap. Wang et al. (Wang et al., 2018) propose CMM-Net for landmark-guided smile generation. Sun et al. (Sun et al., 2018) propose a two-stage framework to perform head inpainting conditioned on the generated facial landmark in the first stage. Chan et al. (Chan et al., 2018) propose a method to transfer motion between human subjects based on pose stick figures in different videos. Yan et al. (Yan et al., 2017) propose a method to generate human motion sequence with a simple background using CGAN and human skeleton information.
The aforementioned approaches focus on a single image generation task. However, in this paper, we propose a novel Cycle In Cycle Generative Adversarial Network (CGAN) which is a multi-task model and aims to handle two different tasks using one single network, i.e., image and keypoint generation. During the training stage, two tasks are restricted mutually by three cycles and benefits from each other. To the best of our knowledge, the proposed model is the first attempt to generate both the image and the keypoint domain in an interactive generation manner within a unified cycle in cycle GAN framework, for the keypoint-guided image translation task. Training GANs are a complicated optimization task and incorporating adversarial keypoint in training provides extra deep supervision to the image generation network compared to using supervision only from the image domain, thus facilitating the network optimization. Moreover, the keypoint generation aims not only to approximate the ground truth output but also to fool the discriminator, meaning that the generated keypoints should represent a real face or a person pose. The correlations between these keypoints can be learned in the adversarial setting.
3. Cycle In Cycle GAN (CGAN)
We start to present the proposed Cycle in Cycle Generative Adversarial Network (CGAN). Firstly, we introduce the network structures of the three different image and keypoint cycles, and also describe details for the corresponding generators and cross-modal discriminators. Secondly, the proposed objective functions for better optimization of the model and also will be illustrated, and finally the implementation details of the whole model and the training procedure are introduced.
3.1. Model Overview
The goal of the proposed CGAN is to learn two different generators in one single network, i.e., keypoint generator and image generator. Two generators are mutually connected through three generative adversarial cycles, i.e., one image-oriented cycle and two keypoint-oriented cycles. In the training stage, all the cycled sub-networks are jointly optimized in an end-to-end fashion and each generator benefit from each other due to the richer cross-modal information and the crossing cycle supervision. The core framework of the proposed CGAN is illustrated in Fig. 3. In the following, we describe the structure details of the proposed CGAN.
3.2. Image-Domain Generative Adversarial Cycle
I2I2I Cycle. The goal of the image cycle I2I2I is to (i) generate image by using the input conditional image and the target keypoint , and then (ii) reconstruct the input image by using the generated image and the keypoint of image . I2I2I cycle can be formulated as:
where is the image generator. Different from previous works such as PG (Ma et al., 2017) and PoseGAN (Siarohin et al., 2018), which only have one mapping . StarGAN (Choi et al., 2018) uses the target and original domain labels and as condition information to recover the input image. However, StarGAN can only handle the task which has a specific number of the category. For the person pose generation task, which could have infinity image domains since a person in the wild can have arbitrary poses, sizes, appearances and locations. In order to solve this limitation, we replace domain labels and in StarGAN by the keypoint and . Follow PG we represent the keypoint as heatmaps. We concatenate and and feed them into the image generator to generate . Next, we concatenate and as inputs of to reconstruct the original image . In this way, the forward and backward consistency can be enforcedly further guaranteed.
Image Generator. We use the U-net architecture (Ronneberger et al., 2015) for our image generator . U-net is comprised of encoder and decoder with skip connections between them. We use two times for generating image and reconstructing image . To reduce model capacity, generators shares parameters between image generation and reconstruction. For image generation, the target of is to generate an image conditioned on the target keypoint image which is similar to the real images . For image reconstruction, the goal of generator is to recover an image that looks close to the input images . tries to learn a combined data distribution between the generation and the reconstruction by sharing parameters, which means receives double data in optimization compared to the generators without using parameter sharing strategy.
Cross-modal Image Discriminator. Different from previous works such as PG (Ma et al., 2017) which employs a single-modal discriminator, we propose a novel cross-modal discriminator which receives both keypoint and image data as input. receives two images and one keypoint data as input. More specifically, aims to distinguish between the generated triplet and the real triplet during image generation stage. We also propose an image adversarial loss based on the vanilla adversarial loss (Goodfellow et al., 2014). The image adversarial loss can be formulated as follows:
tries to minimize while tries to maximize it. A similar image adversarial loss for image reconstruction mapping is defined as:
Image Cycle-Consistency Loss. To better learn the image cycle I2I2I, we propose an image cycle-consistency loss. The loss function writes:
The reconstructed images should closely match to the input image . Note that we use generator two times with the parameter-sharing strategy, and we use distance in image cycle-consistency loss to compute a pixel-to-pixel difference between the recovered image and the real input image .
3.3. Keypoint-Domain Generative Adversarial Cycle
The motivation of the keypoint cycle is that, if the generated keypoint is similar to the real keypoint then the corresponding two images should be very close, as we can see in Fig. 3. We have two keypoint cycles K2G2K and K2R2K. Both of them can provide a supervision signal for optimizing better the image cycle.
K2G2K Cycle. For the K2G2K cycle, we feed into the image generator to produce the target image . Then we employ the keypoint generator to produce the keypoint image from . The generated keypoint should be very close to the real keypoint image . The formulation of K2G2K can be expressed as:
K2R2K Cycle. For K2R2K cycle, the generated image and keypoint image are first concatenated, and then feed into to produce the recovered image . We use to generate the keypoint image of . We assume that the generated keypoint is very similar to the real keypoint image . For the K2R2K cycles, it can be formulated as:
Both generated keypoints and should have a close match to the input keypoint image and , respectively. Note that the generator could share parameters between the two cycles, i.e., K2G2K and K2R2K.
Keypoint Generator. We employ U-net structure (Ronneberger et al., 2015) for our keypoint generator . The input of is an image and the output is a keypoint representation. The keypoint generator produces keypoint and from image and , which can provide extra supervision to the image generator.
Cross-Modal Keypoint Discriminator. The proposed keypoint discriminator is a cross-modal discriminator. It receives both image and keypoint data as inputs. Thus the keypoint adversarial loss for can be defined as:
tries to minimize the keypoint adversarial loss while tries to maximize it. aims to distinguish between the fake pair and the real pair . A similar keypoint adversarial loss for the mapping function is defined as:
Keypoint Cycle-Consistency Loss. To better learn both keypoint cycles, we propose a keypoint cycle-consistency loss, which can be expressed as:
We use distance in the keypoint cycle-consistency loss to compute pixel-to-pixel difference between the generated keypoints , and the real keypoints , . During the training stage, the keypoint cycle-consistency loss can backpropagate errors from the keypoint generator to image generator, which facilitates the optimization of the image generator and thus improves the image generation.
3.4. Joint Optimization Objective
We also note that pixel loss (Siarohin et al., 2018; Ma et al., 2017) can be used to reduce changes and constrain generators. Thus we adopt the image pixel loss between the real images and the generated images . We express this loss as:
We adopt distance as loss measurement in image pixel loss. Consequently, the complete objective loss is:
where , , , and are parameters controlling the relative relation of objectives terms. We aim to solve:
3.5. Implementation Details
In this section, we introduce the detailed network implementation, the training strategy and the inference.
Network Architecture. For a fair comparison, we use the U-net architecture in PG (Ma et al., 2017) as our generators. The encoder of generators is built with the basic Convolution-BatchNorm-LReLU layer. The decoder of generators is built with the basic Convolution-BatchNorm-ReLU layer. The leaky ReLUs in the encoder has a slope 0.2, while all ReLUs in the decoder are not leaky. After the last layer, a Tanh function is used. We employ the PatchGAN discriminator (Isola et al., 2017; Zhu et al., 2017) as our discriminators and . The discriminators are built with the basic Convolution-BatchNorm-ReLU layer. All ReLUs are leaky, with slope 0.2. After the last layer, a convolution is applied to map it to a 1-D value, followed by a Sigmoid function.
Training Strategy. We follow the standard optimization method from (Goodfellow et al., 2014) to optimize the proposed CGAN, i.e., we alternate between one gradient descent step on , , , and , respectively. The proposed CGAN is trained end-to-end and can generate image and keypoint image simultaneously, then the generated keypoint will benefit the quality of the generated image. Moreover, in order to slow down the rate of discriminators , relative to generators , we divide the objectives by 2 while optimizing discriminators , . To enforce discriminators to remember what it has done wrong or right before, we use a history of generated images to update discriminators similar in (Zhu et al., 2017). Moreover, we employ OpenFace (Amos et al., 2016) and OpenPose (Cao et al., 2017) to extract keypoint images and on the Radboud Faces and Market-1501 datasets, respectively. Keypoint of Market-1501 dataset are represented as heatmaps similar as in PG (Ma et al., 2017). In contrast, we set the background of the heatmap to white color and the keypoint to black color on Radboud Faces dataset.
Inference. At inference time, we follow the same settings of PG (Ma et al., 2017) and PoseGAN (Siarohin et al., 2018) via inputting an image and a target keypoint into the image generator , and then obtain the output target image. Similarly, the keypoint generator receives the image as input and then outputs the keypoint of image . We employ the same setting at both training and inference stage.
In this section, we first introduce the details of the datasets used in our experiments, and then we demonstrate the effectiveness of the proposed CGAN and training strategy by presenting and analyzing qualitative and quantitative results.
4.1. Experimental Setup
Datasets. We employ two publicly datasets to validate the proposed CGAN on two different tasks, including Radboud Faces dataset (Langner et al., 2010) for landmark-guided facial expression generation task, and Market-1501 dataset (Zheng et al., 2015) for keypoint-guided person image generation task.
(i) The Radboud Faces dataset (Langner et al., 2010) contains over 8,000 color face images collected from 67 subjects with eight different emotional expressions, i.e., anger, fear, disgust, sadness, happiness, surprise, neutral and contempt. It contains 1,005 images for each emotion and is captured from five cameras with different angles, and each subject is asked to show three different gaze directions. For each emotion, we randomly select 67% of images as training data and the rest 33% images as testing data. Different from StarGAN (Choi et al., 2018), all the images in our experiments are re-scaled to without any pre-processing. For the landmark-guided facial expression generation task, we need pairs of images of the same face with two different expressions. We first remove those images in which the face is not detected correctly using the public OpenFace software (Amos et al., 2016), leading to 5,628 training image pairs and 1,407 testing image pair.
(ii) The Market-1501 dataset (Zheng et al., 2015) is a more challenging person re-id dataset and we use it for the person keypoint and person image generation task. This dataset contains 32,668 images of 1,501 persons captured from six disjoint surveillance cameras. Persons vary in pose, illumination, viewpoint and background in this dataset, which makes the person image generation task more challenging. We follow the setup in PoseGAN (Siarohin et al., 2018). For the training subset, we obtain 263,631 pairs, which is composed of two images of the same person but different poses. For testing subset, we randomly select 12,000 pairs. Note that no person overlapping between the training and testing subsets in this dataset.
|CGAN w/ I2I2I||25.3||21.2030||0.8449|
|CGAN w/ I2I2I+K2G2K||28.2||20.8708||0.8419|
|CGAN w/ I2I2I+K2R2K||28.7||21.0156||0.8437|
|CGAN w/ I2I2I+K2G2K+K2R2K||30.8||21.6262||0.8540|
|CGAN w/ Single-Modal D||26.4||21.2794||0.8426|
|CGAN w/ Non-Sharing G||32.9||21.6353||0.8611|
Parameter Setting. For both datasets, we do left-right flip for data augmentation similar in PG (Ma et al., 2017). For optimization, the proposed CGAN is trained with a batch size of 16 on Radboud Faces dataset. For a fair comparison, all competing models were trained for 200 epochs on Radboud Faces dataset. We use the Adam optimizer (Kingma and Ba, 2015) with the momentum terms and . The initial learning rate for Adam optimizer is 0.0002. For the person image generation task, we train the model for 90 epochs with a smaller batch size 4.
Moreover, we found that the keypoint generator cannot produce accurate keypoints in the early training stage since the image generator produces blurry images during this phase. Therefore, we employ a pre-trained OpenPose model (Cao et al., 2017), which replaces the keypoint generator to produce keypoints with location coordinates at the beginning of the training stage. We also minimize the distance between the generated keypoints (from the generated image) and the corresponding ground truth keypoints (from the ground truth image). Finally, we incorporate the mask loss proposed in PG for person image generation task.
For hyper-parameters setting, we fixed and to 1 and tune the rest using the grid search. We found that the weights of reconstruction losses (i.e., , , ) set between 10 and 100 yield good performance. Thus, , , , and in Eq. (13) are set to 1, 10, 10, 1 and 10, respectively. The proposed CGAN is implemented using public deep learning framework PyTorch. To speed up the training and testing processes, we use an NVIDIA TITAN Xp GPU with 12G memory.
Evaluation Metric. We first adopt AMT perceptual studies to evaluate the quality of the generated images on both datasets similar to (Ma et al., 2017; Siarohin et al., 2018). To seek a quantitative measure that does not require human participation, Structural Similarity (SSIM) (Wang et al., 2004) and Peak Signal-to-Noise Ratio (PSNR) are employed to evaluate the quantitative quality of generated images on the Radboud Faces dataset.
4.2. Model Analysis
We first investigate the effect of the combination of different individual generation cycles to demonstrate the importance of the proposed cycle-in-cycle network structure. Then the parameter-sharing strategy used in the generators for reducing the network capacity is evaluated, and finally the performance influence from the cross-modal discriminators is tested. All the comparison experiments are conducted via training the models for 50 epochs on Radbound Faces dataset. Fig. 4 shows examples of the qualitative results and Table 1 shows the quantitative results.
|GPGAN (Di et al., 2018)||ICPR 2018||0.3||0.8185||18.7211|
|PG (Ma et al., 2017)||NIPS 2017||28.4||0.8462||20.1462|
Influence of Individual Generation Cycle. To evaluate the influence of individual generation cycle, we test with four different combinations of the cycles, i.e., I2I2I, I2I2I+K2G2K, I2I2I+K2R2K, and I2I2I+K2G2K+K2R2K. All four baselines use the same training strategies and hyper-parameters. As we can see in Table 1, I2I2I, K2G2K and K2R2K are all critical to our final results and the removal of one of them degrades the generation performance, demonstrating our initial intuition that by using cross-modal information in a joint generation framework and by making the cycles constraint on each other boost the final performance. I2I2I+K2G2K+K2R2K obtains the best performance, which is significantly better than the single cycle image network I2I2I, demonstrating the effectiveness of the proposed CGAN.
Cross-Modal Discriminator vs. Single-Modal Discriminator. We also evaluate the performance influence of the proposed cross-modal discriminator (CGAN w/ I2I2I+K2G2K+K2R2K). Our baseline is the traditional single-modal discriminator (CGAN w/ Single-Modal D). The single modal-D means the discriminator receives only images as input, i.e., the real input images and the generated images. From Table 1, it is clear that the proposed cross-modal discriminator performs better than the single-modal discriminator on all evaluation metrics, meaning that the rich cross-modal information could help to learn better discriminator and thus facilitate the optimization of the generator.
Parameter Sharing between Generators. The parameter sharing could remarkably reduce the parameters of the whole network. We further evaluate how parameter sharing would affect generation performance. We test two different baselines: one is CGAN w/ I2I2I+K2G2K+K2R2K, which shares the parameters of the two image generators and the two keypoint generators , respectively. While CGAN w/ Non-Sharing G separately learns the four generators. We can observe from Table 1 that the non-sharing one achieves slightly better performance than sharing one. However, the number of parameters of non-sharing one is 217.6M, which doubles that of the sharing one. It means that the parameter sharing is a good strategy for balancing performance and overhead.
|Model||Publish||AMT (R2G)||AMT (G2R)||SSIM||IS||mask-SSIM||mask-IS|
|PG (Ma et al., 2017)||NIPS 2017||11.2||5.5||0.253||3.460||0.792||3.435|
|DPIG (Ma et al., 2018)||CVPR 2018||-||-||0.099||3.483||0.614||3.491|
|PoseGAN (Siarohin et al., 2018)||CVPR 2018||22.7||50.2||0.290||3.185||0.805||3.502|
4.3. Comparison against the State-of-the-Art
Competing Models. We consider several state-of-the-art keypoint-guided image generation models as our competitors, i.e., GPGAN (Di et al., 2018), PG (Ma et al., 2017), DPIG (Ma et al., 2018) and PoseGAN (Siarohin et al., 2018). Note that for PG, DPIG and PoseGAN, all of them need to use off-the-shelf keypoint detection models to extract the keypoints, and all of them are restricted on dataset sources or tasks where the keypoint information is available. Different from PG, DPIG and PoseGAN, which focus on the person image generation task, the proposed CGAN is a general model and learns image and keypoint generation simultaneously in a joint network. For a fair comparison, we implement all the models using the same setups as our approach.
Task 1: Landmark-Guided Facial Expression Generation. A qualitative comparison of different models on the Radboud Faces dataset is shown in Fig. 5. It is clear that GPGAN performs the worse among all the comparison models. While the results of PG tend to be blurry. Compared with both GPGAN and PG, the results of CGAN are more smooth, sharper and contains more image details.
Existing competitors such as PG (Ma et al., 2017) conduct the experiments on 256256 resolution images. For a fair comparison with them, we also conduct experiments on this image resolution size but note that the proposed CGAN can be applied for any size images with only small architecture modification.
Task 2: Keypoint-Guided Person Image Generation. Fig. 8 shows the results of PG, PoseGAN and CGAN on the Market-1501 dataset. As we can see, the proposed CGAN is able to generate visually better images than PG and PoseGAN. For the first row of results in Fig. 8, CGAN can generate reasonable results while PG cannot produce any meaningful content. PoseGAN can generate the person but cannot preserve the color information. For the second row of results in Fig. 8, both PoseGAN and PG failed to generate the same child, while CGAN can generate the same child with only a small part missing at the head. For the last row of results in Fig. 8, we can clearly observe the advantage of CGAN as both PoseGAN and PG failed to generate the hat. We also provide more qualitative results of CGAN in Fig. 6. As we can see that the proposed CGAN can generate photo-realistic images with convincing details. Moreover, the generated images are very close to the ground truths.
Finally, we also note that GANs are difficult to train and easily have mode collapse. However, in keypoint-guided image generation tasks, avoiding mode collapse is not necessarily needed since if you input a person image and a target pose, the model tries to generate this particular person in this particular pose.
Quantitative Comparison of Both Tasks. We provide here quantitative results and analysis on both tasks. As shown in Table 2, CGAN achieves the best performance on the Radboud Faces dataset with all the metrics for landmark-guided facial expression generation task. Moreover, we quantitatively compare the proposed CGAN with PoseGAN, DPIG and PG on keypoint-guided person image generation task in Table 3. We can observe that CGAN obtains better performance than PG and DPIG on all the evaluation metrics except for IS. Compared with PoseGAN, CGAN yields very competitive performance. Specifically, we achieve better performance in terms of the AMT (R2G), IS, mask-SSIM and mask-IS metrics.
4.4. Visualization of Keypoint Generation
CGAN is a cross-modal generation model and it is not only able to produce the target person but also able to produce the keypoint of the input image. Both generation tasks benefit from the improvement of each other in an end-to-end training fashion. We present examples of the keypoint generation results on Radboud Faces dataset in Fig. 7. The inputs are image and keypoint and the outputs are image and keypoint , other images and keypoints are given for comparison. As we can see that the generated keypoint is very close to the real keypoint , which verifies the effectiveness of the keypoint generator and the joint learning strategy.
In this paper, we propose a novel Cycle In Cycle Generative Adversarial Network (CGAN) for keypoint-guide image generation task. CGAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. The image generator aims at reconstructing the target image based on a conditional image and the target keypoint, and the keypoint generator tries to generate the target keypoint and further provide cycle supervision to the image generator for generating more photo-realistic images. Both generators are connected in a unified network and can be optimized in an end-to-end fashion. Both qualitative and quantitative experimental results on facial expression and person pose generation tasks demonstrate that our proposed framework is effective to generate high-quality images with convincing details.
We want to thank the NVIDIA Corporation for the donation of the TITAN Xp GPUs used in this work.
- Amos et al. (2016) Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. CMU-CS-16-118, CMU School of Computer Science.
- Anoosheh et al. (2018) Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. 2018. ComboGAN: Unrestrained Scalability for Image Domain Translation. In CVPR Workshop.
- Brock et al. (2018) Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale gan training for high fidelity natural image synthesis. In ICLR.
- Cao et al. (2017) Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
- Chan et al. (2018) Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2018. Everybody dance now. In ECCV Workshop.
- Chen et al. (2018) Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. 2018. Attention-GAN for object transfiguration in wild images. In ECCV.
- Choi et al. (2018) Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In CVPR.
- Di et al. (2018) Xing Di, Vishwanath A Sindagi, and Vishal M Patel. 2018. GP-GAN: gender preserving GAN for synthesizing faces from landmarks. In ICPR.
- Dolhansky and Canton Ferrer (2018) Brian Dolhansky and Cristian Canton Ferrer. 2018. Eye in-painting with exemplar generative adversarial networks. In CVPR.
- Duan et al. (2019) Bin Duan, Wei Wang, Hao Tang, Hugo Latapie, and Yan Yan. 2019. Cascade Attention Guided Residue Learning GAN for Cross-Modal Translation. arXiv preprint:1907.01826 (2019).
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein gans. In NIPS.
- Isola et al. (2017) Phillip Isola, Junyan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR.
- Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. In ICLR.
- Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In CVPR.
- Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In ICML.
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- Korshunova et al. (2017) Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Lucas Theis. 2017. Fast face-swap using convolutional neural networks. In ICCV.
- Langner et al. (2010) Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, and AD Van Knippenberg. 2010. Presentation and validation of the Radboud Faces Database. Taylor & Francis Cognition and emotion (2010).
- Ma et al. (2017) Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose guided person image generation. In NIPS.
- Ma et al. (2018) Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In CVPR.
- Ma et al. (2018) Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. 2018. DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In CVPR.
- Mansimov et al. (2015) Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. 2015. Generating images from captions with attention. In ICLR.
- Mathieu et al. (2016) Michael Mathieu, Camille Couprie, and Yann LeCun. 2016. Deep multi-scale video prediction beyond mean square error. In ICLR.
- Mejjati et al. (2018) Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. 2018. Unsupervised attention-guided image-to-image translation. In NeurIPS.
- Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint:1411.1784 (2014).
- Mo et al. (2019) Sangwoo Mo, Minsu Cho, and Jinwoo Shin. 2019. InstaGAN: Instance-aware Image-to-Image Translation. In ICLR.
- Odena (2016) Augustus Odena. 2016. Semi-supervised learning with generative adversarial networks. In ICML Workshop.
- Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR.
- Perarnau et al. (2016) Guim Perarnau, Joost van de Weijer, Bogdan Raducanu, and Jose M Álvarez. 2016. Invertible Conditional GANs for image editing. In NIPS Workshop.
- Qiao et al. (2018) Fengchun Qiao, Naiming Yao, Zirui Jiao, Zhihao Li, Hui Chen, and Hongan Wang. 2018. Geometry-Contrastive Generative Adversarial Network for Facial Expression Synthesis. arXiv preprint:1802.01822 (2018).
- Reed et al. (2016b) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b. Generative Adversarial Text-to-Image Synthesis. In ICML.
- Reed et al. (2016c) Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Victor Bapst, Matt Botvinick, and Nando de Freitas. 2016c. Generating interpretable images with controllable structure. Technical Report (2016).
- Reed et al. (2016a) Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016a. Learning what and where to draw. In NIPS.
- Regmi and Borji (2018) Krishna Regmi and Ali Borji. 2018. Cross-view image synthesis using conditional gans. In CVPR.
- Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
- Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In NIPS.
- Siarohin et al. (2019) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. Animating arbitrary objects via deep motion transfer. In CVPR.
- Siarohin et al. (2018) Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, and Nicu Sebe. 2018. Deformable GANs for Pose-based Human Image Generation. In CVPR.
- Song et al. (2018) Lingxiao Song, Zhihe Lu, Ran He, Zhenan Sun, and Tieniu Tan. 2018. Geometry guided adversarial facial expression synthesis. In ACM MM.
- Sun et al. (2018) Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Natural and effective obfuscation by head inpainting. In CVPR.
- Taigman et al. (2017) Yaniv Taigman, Adam Polyak, and Lior Wolf. 2017. Unsupervised cross-domain image generation. In ICLR.
- Tang et al. (2018) Hao Tang, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. 2018. GestureGAN for Hand Gesture-to-Gesture Translation in the Wild. In ACM MM.
- Tang et al. (2019b) Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J Corso, and Yan Yan. 2019b. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In CVPR.
- Tang et al. (2019a) Hao Tang, Dan Xu, Nicu Sebe, and Yan Yan. 2019a. Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation. In IJCNN.
- Tang et al. (2018) Hao Tang, Dan Xu, Wei Wang, Yan Yan, and Nicu Sebe. 2018. Dual Generator Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In ACCV.
- Van Den Oord et al. (2016) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio.. In SSW.
- Vondrick et al. (2016) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In NIPS.
- Wang et al. (2018a) Tingchun Wang, Mingyu Liu, Junyan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-video synthesis. In NeurIPS.
- Wang et al. (2018b) Tingchun Wang, Mingyu Liu, Junyan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In CVPR.
- Wang et al. (2018) Wei Wang, Xavier Alameda-Pineda, Dan Xu, Pascal Fua, Elisa Ricci, and Nicu Sebe. 2018. Every smile is unique: Landmark-guided diverse smile generation. In CVPR.
- Wang and Gupta (2016) Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In ECCV.
- Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE TIP 13, 4 (2004), 600–612.
- Yan et al. (2017) Yichao Yan, Jingwei Xu, Bingbing Ni, Wendong Zhang, and Xiaokang Yang. 2017. Skeleton-aided Articulated Motion Generation. In ACM MM.
- Yi et al. (2017) Zili Yi, Hao Zhang, Ping Tan Gong, et al. 2017. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In ICCV.
- Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient.. In AAAI.
- Zhang et al. (2019) Jichao Zhang, Meng Sun, Jingjing Chen, Hao Tang, Yan Yan, Xueying Qin, and Nicu Sebe. 2019. GazeCorrection: Self-Guided Eye Manipulation in the wild using Self-Supervised Generative Adversarial Networks. arXiv preprint:1906.00805 (2019).
- Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In ICCV.
- Zhou et al. (2017) Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, and Weiran He. 2017. GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data. In BMVC.
- Zhu et al. (2017) Junyan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.