Unpaired Image Translation via Adaptive Convolution-based Normalization

  • 2019-11-29 18:16:03
  • Wonwoong Cho, Kangyeol Kim, Eungyeup Kim, Hyunwoo J. Kim, Jaegul Choo
  • 25

Abstract

Disentangling content and style information of an image has played animportant role in recent success in image translation. In this setting, how toinject given style into an input image containing its own content is animportant issue, but existing methods followed relatively simple approaches,leaving room for improvement especially when incorporating significant stylechanges. In response, we propose an advanced normalization technique based onadaptive convolution (AdaCoN), in order to properly impose style informationinto the content of an input image. In detail, after locally standardizing thecontent representation in a channel-wise manner, AdaCoN performs adaptiveconvolution where the convolution filter weights are dynamically estimatedusing the encoded style representation. The flexibility of AdaCoN can handlecomplicated image translation tasks involving significant style changes. Ourqualitative and quantitative experiments demonstrate the superiority of ourproposed method against various existing approaches that inject the style intothe content.

 

Quick Read (beta)

Unpaired Image Translation via Adaptive Convolution-based Normalization

Wonwoong Cho, Kangyeol Kimfootnotemark: , Eungyeup Kim, Hyunwoo J. Kim, Jaegul Choo
Korea University
{tyflehd21,kky1994,yhy1254,hyunwoojkim,jchoo}@korea.ac.kr
These authors contributed equally
  

Supplementary material

Wonwoong Cho, Kangyeol Kimfootnotemark: , Eungyeup Kim, Hyunwoo J. Kim, Jaegul Choo
Korea University
{tyflehd21,kky1994,yhy1254,hyunwoojkim,jchoo}@korea.ac.kr
These authors contributed equally
  
Abstract

Disentangling content and style information of an image has played an important role in recent success in image translation. In this setting, how to inject given style into an input image containing its own content is an important issue, but existing methods followed relatively simple approaches, leaving room for improvement especially when incorporating significant style changes. In response, we propose an advanced normalization technique based on adaptive convolution (AdaCoN), in order to properly impose style information into the content of an input image. In detail, after locally standardizing the content representation in a channel-wise manner, AdaCoN performs adaptive convolution where the convolution filter weights are dynamically estimated using the encoded style representation. The flexibility of AdaCoN can handle complicated image translation tasks involving significant style changes. Our qualitative and quantitative experiments demonstrate the superiority of our proposed method against various existing approaches that inject the style into the content.

 

Unpaired Image Translation via Adaptive Convolution-based Normalization


  Wonwoong Chothanks: These authors contributed equally, Kangyeol Kimfootnotemark: , Eungyeup Kim, Hyunwoo J. Kim, Jaegul Choo Korea University {tyflehd21,kky1994,yhy1254,hyunwoojkim,jchoo}@korea.ac.kr

\@float

noticebox[b]\[email protected]

1 Introduction

Recently, unpaired image-to-image translation Zhu et al. (2017); Kim et al. (2017); Choi et al. (2018) has been actively studied as one of the major research areas. It aims to learn inter-domain mappings without paired images, such that deep neural networks can translate a given image from one domain to another (e.g., real photo artwork). However, these methods bear a fundamental limitation of generating a uni-modal output given a single image even if multiple diverse outputs may exist. In response, several approaches Lin et al. (2018); Huang et al. (2018); Lee et al. (2018); Xiao et al. (2018); Ma et al. (2019); Chang et al. (2018) have been proposed to achieve the multi-modality that indicates the capability of generating multiple outputs given a single input image by taking an additional input, such as an exemplar image conveying detailed style information to transfer.

Although exemplar-based image translation achieves multi-modality of outputs owing to its flexibility in reflecting the exemplar image that gives fine details of intended style, there still remains the issue of how to properly impose the style feature extracted from an exemplar image into a content image. Previous approaches Lee et al. (2018); Huang and Belongie (2017); Cho et al. (2019) commonly follows two steps of first standardizing features and then applying a particular transformation, where the first step can be regarded as removing the existing style information of an input image and the second step plays a role of imposing the exemplar style to the style-neutralized input feature.

As one of the state-of-the-art methods, adaptive instance normalization (AdaIN) Huang and Belongie (2017) has been successfully utilized to combine content and style in a slew of studies Huang et al. (2018); Ma et al. (2019); Karras et al. (2019). AdaIN incorporates different features by matching each channel’s first-order statistics , e.g., the mean and the variance, in the content to those in the style. To this end, AdaIN first standardizes each channel of content feature and adaptively performs channel-wise scaling and shifting using the parameters regressed by the style feature. Another recently proposed method called group-wise deep whitening-and-coloring transformation (GDWCT) Cho et al. (2019) has shown superior capability of imposing drastically different styles by matching higher-order statistics such as covariance, which we call the coloring transformation, in addition to the first-order ones.

In the above methods, we claim that the second step of imposing the target statistics can be viewed as a simpler variant or a special case of a convolution operation, as illustrated in Fig. 1. That is, (a) the channel-wise affine transformation used in AdaIN can be viewed as the channel-wise 1×1 convolution. (b) On the other hand, the coloring transformation of GDWCT, which matches the target covariance, can be considered as the 1×1 convolution operation that generates each output channel as a linear combination of the entire input channels.11 1 The additional illustration can be found in Appendix. However, these methods tend to fail in handling a dramatic shape change because the methods have limited capability in translating significant transfiguration.

From this unified perspective of a convolution operation, these existing methods relied only on its simpler forms with only using 1×1 convolution filters, and thus, the potentials of leveraging general convolution operations with larger-than-1×1 filters when injecting target style has not yet been fully explored.

Inspired by this, we propose an adaptive convolution-based normalization (AdaCoN) as an advanced method to inject the target style to a given image. AdaCoN is basically composed of two steps of standardization and adaptive convolution. First, the standardization is locally performed on each sub-region of an input activation map where the convolution filter is applied, similar to previous work Jarrett et al. (2009); Krizhevsky et al. (2012). Second, AdaCoN performs adaptive convolution where the (larger-than-1×1) convolution filter weights are dynamically estimated using the encoded style representation.

By taking into account spatial patterns due to a convolution operation, we hypothesize that AdaCoN is capable of flexibly performing a spatially-adaptive image translation, which can potentially handle complicated image translation tasks involving significant style changes. In this sense, AdaCoN has something in common with the recent success in patch-based style transfer Chen and Schmidt (2016); Gu et al. (2018) that dynamically applies different styles to each patch of an input image.

In order to verify the superiority of AdaCoN, we conduct both quantitative and qualitative experiments that compare different normalization methods while maintaining the same model architectures.

Figure 1: Comparisons of different normalization methods for image translation. Each existing method can be viewed as the special case or the variant of a convolution operation.

2 Related work

Unpaired image translation.

Unpaired image translation aims to transform an input image from one domain to another without paired images. Numerous approaches Zhu et al. (2017); Kim et al. (2017); Liu et al. (2017) have been proposed for this task. Recently, multimodal image translation methods, capable of yielding multiple different images given a particular image, have also been studied Huang et al. (2018); Lee et al. (2018); Cho et al. (2019). These studies take similar approaches to address the uni-modality problem of previous methods by incorporating an exemplar image as a guidance for image translation. In addition, they assume that a latent image space can be disentangled into the content space that contains an underlying structure of images and the style space that maintains a domain-specific feature. However, they propose different methods for integrating the disentangled content feature from the input image and the style feature from the exemplar image. To be specific, inspired by AdaIN Huang and Belongie (2017), MUNIT Huang et al. (2018) adopts the idea of matching the statistics between the content and the style features. Extending this idea, GDWCT Cho et al. (2019) leverages higher-order statistics compared to the previous method, enhancing the quality of generated images. Meanwhile, DRIT Lee et al. (2018) simply concatenates the content and the style features to perform image translation. However, these methods have a limited capability to handle the drastic changes between the domains. A recently proposed method called instaGAN Mo et al. (2018) tackles this problem by taking the segmentation mask as additional input, which serves as strong hint for translation.

Adaptive convolution.

Unlike standard convolution layers where the filter weights are trainable constant values, an adaptive convolution layer uses varying filter weights dynamically determined by input data. Based on this idea, dynamic filter networks Jia et al. (2016) proposed to take an auxiliary input image to determine convolution filter weights in an video prediction task. Furthermore, Kang et al. Kang et al. (2017) showed that convolution filter weights from the side information such as camera perspective or noise level can be utilized to improve the performance of classification task. Recent studies proposed to apply adaptive convolution to a variety of tasks such as semantic segmentation Harley et al. (2017); Su et al. (2019) and motion prediction Xue et al. (2016). In this paper, we propose AdaCoN, which adaptively obtains convolution weights associated with convolution-based normalization for an image translation task.

Figure 2: Overview of our networks.

3 Proposed Methods

In this section, we briefly describe our backbone networks for an image translation task. Afterwards, we concretely describe our proposed method in detail.

3.1 Translation backbone

Networks overview.

Let xA and xB denote randomly sampled images from two different domains of 𝒳A and 𝒳B, respectively. Given two images, our networks translate xA from domain 𝒳A to domain 𝒳B as well as xB from domain 𝒳B to domain 𝒳A. To this end, we adopt the disentangling strategy Huang et al. (2018); Lee et al. (2018); Cho et al. (2019) that decomposes an image into a domain-invariant content feature (e.g., an identity of a person) and a domain-specific style feature (e.g., the hair length in the female domain). This can be formulated as

zAc,zAs=EAc(xA),EAs(xA),zBc,zBs=EBc(xB),EBs(xB), (1)

where {EAc ,EBc} are content encoders and {EAs, EBs} are style encoders. By combining the content and the style features of the different domains {(zBc,zAs), (zAc,zBs)} and forwarding it to decoders {GA,GB}, we obtain the translated results {xBA, xAB}, i.e.,

xAB=GB(AdaCoN(zAc,zBs)),xBA=GA(AdaCoN(zBc,zAs)), (2)

where AdaCoN indicates our adaptive convolution-based normalization that incorporates given content and style features. As shown in Fig. 2, for example, given 𝒳A and 𝒳B in the woman and the man domains, respectively, let us assume that our networks translate the woman to the man xAB. (a) We first extract the content feature zAc from a woman image xA and the style feature zBs from a man image xB by forwarding each image into the content encoder EAc and the style encoder EBs. (b) We next inject the style to the content feature through AdaCoN and (c) forward the combined features into the decoder GA. After obtaining a fake man image xAB, (d) we exploit the fake image as an input of a discriminator DA that encourages the generated image distribution to be close to the real image distribution. (e) Lastly, we repeat the processes of (a)-(c) in order to obtain a reconstructed woman image xABA, enabling our networks to maintain an original identity. In this manner, our networks are trained to translate the images between two different domains.

Loss functions.

Our networks are composed of several losses, and each term plays a crucial role in appropriately training our networks. In order to avoid redundancy, we focus on a translation of (𝒳A𝒳B𝒳A) from this point on. First, we leverage the pixel-level reconstruction losses, such as the cycle-consistency loss and the identity loss Zhu et al. (2017) in order to guarantee the high-quality of generated images. The image reconstruction losses can be represented as

cycABA=𝔼[xABA-xA1],iAA=𝔼[xAA-xA1]. (3)

We also use latent-level reconstruction losses that encourage the networks to impose style information while maintaining the original content during the forwarding phase. First, the style reconstruction loss is computed between the style features of (zABs,zBs), which makes our networks properly reflect the style because zABs is constrained to be equivalent to zBs. Second, the content reconstruction loss is computed between (zAc,zABc), and this encourages the networks to maintain the original content zAc after performing a translation. These two losses can be formulated as

sAB=𝔼[EBs(xAB)-EBs(xB)1],cAB=𝔼[EBc(xAB)-EAc(xA)1] (4)

Lastly, the adversarial loss Goodfellow et al. (2014) is used to minimize the distance of the two distributions of the real images in a target domain and the generated images. For this purpose, we exploit LSGAN Mao et al. (2017) as our adversarial loss, i.e.,

DadvB=12𝔼[(D(xB)-1)2]+12𝔼xAB[(D(xAB))2],GadvB=12𝔼[(D(xAB)-1)2] (5)

Note that our translation backbone is trained to translate in both directions of (𝒳A𝒳B𝒳A) and (𝒳B𝒳A𝒳B). Finally, our full loss is formulated as

D =DadvA+DadvB (6)
G =GadvA+GadvB+λlatent(s+c)+λpixel(cyc+iAA+iBB), (7)

where each term without the domain notation is bidirectionally applied within two different domains, and we empirically set λlatent=1 and λpixel=10.

Figure 3: Overview of the style branch. The first step of the style branch is (a) the local standardization step that makes each local patch of the input activation map have a zero mean and a unit variance, e.g., neutralizing the original style. The second step is (b) the style injection into the standardized local patch by applying dynamically determined convolution filters. Detailed descriptions are found in Section 3.2.2.

3.2 Adaptive convolution-based normalization (AdaCoN)

The goal of AdaCoN is to produce an output feature zout that can reflect the style of zs while maintaining the identity of zc. The combined feature zout is used as input to a decoder to generate a translated image. Note that we omit the domain notation in this section for brevity.

3.2.1 Basic components

As illustrated in Fig. 2 (h), AdaCoN is composed of a style branch (h1) to reflect the style and a content branch (h2) that aims to maintain the content identity. Given the content zc and the style zs, the style branch learns to inject the style into the content. On the other hand, the content branch learns to keep the essential information of the given zc, so that the output of AdaCoN can maintain its original identity. Lastly, In the joining step (h3), the outputs of the branches are concatenated and forwarded into a subsequent convolution layer. Note that an additional analysis of this structure is provided in Appendix.

3.2.2 Style branch

Standardization function.

gAdaCoN normalizes the content feature zc before applying adaptive convolution. Specifically, we compute the statistics of zc from the channel-wise local patch of the size kH×kW, where kH and kW are a kernel height and a kernel width, respectively. We use gAdaCoN because locally computed statistics can be more effective in normalizing a given feature than globally computed ones. Our standardization is formulated as

z¯c=gAdaCoN(zc)=ϕ(zc)-μkH,kW(ϕ(zc))σkH,kW(ϕ(zc)), (8)

where ϕ denotes an unfolding operation that amasses every patch of zc and unites it into one tensor. Fig. 3(a) concretely describes the procedure. (a1) given zcC×H×W, (a2) ϕ extracts each sliding local block in C×kH×kW from the zero-padded zc and the extracted blocks are united into one tensor in H×W×C×kH×kW. (a3) In order to perform the standardization, we compute the mean and the standard deviation along the dimensions of kH×kW. (a4) We then normalize the content feature by exploiting its local channel-wise statistics. That is, gAdaCoN performs a local normalization by using statistics specified in local patch. Note that H and W dimensions of ϕ(zc) imply a spatial coordinate of the local patch where it is extracted from, such that the number of patches is equivalent to H×W. (a5) Finally, we obtain the patch-wisely normalized feature in CkHkW×H×W.

Adaptive convolution layer.

fAdaCoN takes zs and zc as inputs and generates a stylized feature zcs as output. Specifically, as illustrated in Fig. 3 (b1), fAdaCoN first takes the style feature zsC×kH×kW as an input and (b2) encodes zs to the convolution weights z^sC×O×kH×kW, where O is the number of output channels. Lastly, after unfolding it to the dimensions of CkHkW×1×1, we apply this weights as the form of the convolution operation and obtain the stylized feature zcs (b3-b5). Finally, the adaptive convolution is formulated as

z^s=ψ(zs),fAdaCoN(z¯c,z^s)=k=1kHl=1kW[z¯c(i+k-1,j+l-1)z^s(k,l,n)], (9)
           fori=1,,H,j=1,,W,andn=1,,O,

where ψ represents a function that learns to properly encode a given style zs as the convolution weight z^s of fAdaCoN. i and j indicates the horizontal and the vertical coordinates, respectively, and H and W are the height and the width of z¯c, respectively. Lastly, we add the mean of the style μs to the stylized feature zcs that can be viewed as a bias in the convolution operation.

4 Experiments

This section describes the dataset and the baseline models we used for the experiments in Section 4.1. Subsequently, we discuss the comparison results with the baselines in Section 4.2. Lastly, we analyze our proposed method in detail in Section 4.3.

4.1 Experimental settings

Dataset

We conduct evaluations with diverse datasets. First, we use CelebA dataset Liu et al. (2015). This is a widely-used facial dataset involving multiple attributes. In order to construct a dataset with a large domain gap, we combine several attributes and newly form the dataset, such as (Male, Non-Bangs, Non-SmilingFemale, Bangs, Smiling). Second, we use BAM dataset Wilber et al. (2017), composed of numerous artworks labeled with its artistic style, such as watercolor and vector-graphic. We use Watercolor Pen, Vector Pen, and Oil Pen, in order to demonstrate AdaCoN can perform image translation with a substantial domain difference. Finally, Edges Handbag Zhu et al. (2017) and Summer Winter Isola et al. (2017) datasets are used to confirm the wide applicability of AdaCoN in diverse image translation tasks. We commonly set the size of the image as 256×256 in all the experiments.

Baseline methods

We compare our proposed method with the AdaIN Huang and Belongie (2017) exploited in MUNIT, and GDWCT Cho et al. (2019). The main difference among them lies in a specific method of combining the content feature with the style feature. As for the settings of ours, we explore various settings by adjusting the hyperparameters, such as the kernel size {3,7,11} and the style dimension {8,64,128} of AdaCoN. We empirically set the kernel size of 3 and the style dimension of 128. The specific results of those hyperparmeters are reported in the Section 4.3.

Training details

For training the models, we exploit the Adam optimizer (Kingma and Ba, 2015) with β1=0.5 and β2=0.999. We empirically adopt the initialization method (He et al., 2015) for initializing our models. We also set one for the batch size and 0.0001 for the learning rate. We regularly decay the learning rate by half in every 50,000 iteration and the decaying is started from 200,000 iterations. Every model exploited in the experiments are trained for 500,000 iterations on a NVIDIA TITAN Xp GPU for 90 hours.

Evaluation metric

In order to evaluate the methods, we measure the the classification accuracy as well as content distance using a pretrained Inception-v3 model Szegedy et al. (2016). To be specific, the content distance is measured by computing L2 distance of the features from intermediate layer of Inception-v3 between the input images and the translated ones. A lower content distance indicates that the gap between the them is relatively small. On the other hand, the evaluation on style injection is measured by the classification accuracy. This is because a well-trained image translation model can transform the domain of input image, so that a higher classification accuracy shows that the translation model successfully generates the prominent characteristics of the target domain. For training the classification model, we exploit the pretrained Inception-v3 and fine-tuned on CelebA dataset Liu et al. (2015). To evaluate the performance on multi-attribute translation task, we train the classifiers with multi-label dataset.

4.2 Baseline comparison

This section reports the comparison results of AdaCoN with other baseline methods. Quantitative results using the classification accuracy and the content distance are described in Section 4.2.1 and the qualitative results on CelebA dataset Liu et al. (2015) is reported in Section 4.2.2.

Method 𝐆𝟏𝐆𝟐 𝐆𝟐𝐆𝟏 𝐃𝟏𝐃𝟐 𝐃𝟐𝐃𝟏 𝐓𝟏𝐓𝟐 𝐓𝟐𝐓𝟏 𝐙𝟏𝐙𝟐 𝐙𝟐𝐙𝟏
AdaIN 0.173/89.5 0.179/88.9 0.162/53.8 0.166/58.5 0.196/67.5 0.195/51.6 0.192/33.8 0.191/82.9
GDWCT 0.174/90.6 0.190/90.4 0.173/52.3 0.175/64.4 0.202/64.9 0.200/47.9 0.197/30.5 0.199/85.6
AdaCoN 0.186/91.6 0.184/90.0 0.202/62.3 0.202/66.5 0.193/67.7 0.197/57.5 0.199/36.5 0.201/86.7

Table 1: Content loss and overall classification results(%). We bidirectionally calculate the metric with CelebA dataset Liu et al. (2015). Each value in the cell indicates content loss and overall classification accuracy respectively. Abbreviations: 𝐆𝟏(Male), 𝐆𝟐(Female), 𝐃𝟏(Young, Non-Smiling), 𝐃𝟐(Old, Smiling), 𝐓𝟏(Non-Bald, Young, Eyeglasses), 𝐓𝟐(Bald, Old, Non-Eyeglasses), 𝐙𝟏(Male, Non-Bangs, Non-Smiling),𝐙𝟐(Female, Bangs, Smiling)
Figure 4: Comparisons with baselines; (a):𝐆𝟏𝐆𝟐, (b):𝐃𝟏𝐃𝟐, (c):𝐓𝟏𝐓𝟐, (d):𝐙𝟏𝐙𝟏

4.2.1 Quantitative comparison

The classification accuracy increase when a translated output is correctly classified over every target attribute. As shown in Table. 1, our model displays the higher classification accuracy than other baselines. Moreover, the gap between AdaCoN and other baselines tends to be larger in multi-attribute translation task than the single attribute translation one. We believe this is because the multi-attribute translation tasks demand more considerable style injection than the single-attribute translation. For example, in case of (Z1Z2), in order to translate an image to the target attributes, the translation networks must change the regions of the manly characteristics, the hair, and the mouth. On the other hand, the case of (G1G2) requires to change only the regions of the manly characteristics, of which the amount of changes the task demands is relatively small. As for the content distance, even though AdaCoN obtains the highest score in the content distance in most translation cases, the small amount of differences ensures that AdaCoN can maintain content-identity. Considering our objective is strong reflection of the style, it is tolerable to lose the small amount of content information.

4.2.2 Qualitative comparison

Fig. 4 shows the comparison results of AdaCoN with baselines on various attribute translation cases. The results demonstrate that AdaCoN can significantly reflect the style compared to baselines. For example, in case of (c) in the left macro column, whose the target attributes are (Bald, Non-Eyeglasses, Old), AdaCoN considerably applies the style of the exemplar, such that the result of AdaCoN represents the bald and old man without the eyeglasses. However, both AdaIN and GDWCT keep the hair even though the style of the exemplar includes the bald attribute. On the other hand, (a) in the right macro column, of which the target attribute is Male shows the difference of the amount of the style reflection between baselines. Specifically, in order to transfer the style of man, every baseline removes the make-up. Furthermore, AdaIN makes the beard while keeping the hair length long. GDWCT incompletely removes hair region while AdaCoN clearly removes the hair region. Since the long hair is the dominant characteristic of woman, the output of AdaCoN changed to short hair verifies the superior performance of AdaCoN in style reflection.

4.3 Additional analysis

Figure 5: Kernel size comparison and justification of standardization function. We perform experiments in order to explore the effects of the kernel size of AdaCoN and justify our standardization function. We exploit (Oil Pen) and (Watercolor Pen) of BAM dataset Wilber et al. (2017) in (a) and (b), respectively.
Effects of kernel size.

As shown in Fig. 5(a), the kernel size is relevant to the spatial-awareness. In the first row, the hair color on the chest of the woman of the content image is different from the other hair color of hers. Because the small receptive field is disadvantageous to recognizing the wide hair region, K3 fails in generating the hair on the chest naturally. On the other hand, K11 shows the better results in generating the hair region because it has the larger receptive field. Furthermore, we observe that the larger kernel size engenders the larger amount of style reflection. For instance, the results of K11 more strongly reflect the style, so that it distorts the eye and mouth of the content in the first row and represents more conspicuous texture in the second row, compared to the results of K3.

Effects of standardization function.

Fig. 5(b) shows the effects of the standardization function of AdaCoN. AdaCoN-g represents the results from a model trained without the standardization function gAdaCoN. As shown in the results in both rows, gAdaCoN plays essential role in injecting a style because the model trained without the standardization function gAdaCoN fails in performing a translation. We believe this is attributed to the conflicts of the style features between the content (input) and the style (exemplar) images. Specifically, the input image has both the content and the style features, so that if its style feature is not removed by gAdaCoN, the style feature extracted from the exemplar can give rise to the degradation of the style reflection performance. As a consequence, the results demonstrate that our proposed standardization function based on local normalization is essential in AdaCoN.

Figure 6: Effects of style dimension and results from diverse dataset. (a) performs the translation of Male Female). (b) is conducted with (b1): Pen Watercolor, (b3): Winter Summer, (b4): Oil Pen, (b5): Summer Winter, (b2, b6): Male, Smile, Straight-Hair, Big-Nose Female, Non-smile, Wavy-Hair, Small-Nose, individually.
Effects of style dimension (O) and results on diverse dataset.

We compare the effects of O that indicates the number of channels of zcs. As discussed in Appendix, O determines the extent of the style reflection to the output of AdaCoN zout. As illustrated in Fig. 6(a), the results verify that the amount of the style reflection is directly affected by O. For instance, (a1) shows the hair region of O=128 is clearly removed while O=8 relatively keeps hair region. We further observe that a beard, the other dominant characteristic of man, is rather transferred in O=8. This shows that the low dimension of zcs tends to translate the domain with the minimum change. That is, this result demonstrates that the size of O has a positive correlation with the amount of the style reflection, such that it can be usefully exploited when attempting to control the extent of the style reflection. Meanwhile, in order to verify AdaCoN can be exploited widely as well as robustly along the diverse dataset, we conduct the experiment in Fig. 6(b). The results consistently show that AdaCoN can translate a given image with a rich style.

5 Conclusion

In this paper, we proposed the novel normalization method that can dramatically inject the style of the given exemplar in a image translation. AdaCoN locally performs the standardization of the content representation in order to properly reflect the given style, and the adaptive convolution layer, whose weights are dynamically extracted from the style encoding is applied to the standardized feature. We verify the superior performance of AdaCoN in drastic style injection through the experiments. We believe AdaCoN can be usefully exploited in diverse challenging image translation tasks that have a large gap between a source and a target domain, such as the multi-attribute translation. Finally, AdaCoN can be potentially used by incorporating an additional information with our novel normalization technique in various tasks such as object detection and semantic segmentation.

References

  • [1] H. Chang, J. Lu, F. Yu, and A. Finkelstein (2018) PairedCycleGAN: asymmetric style transfer for applying and removing makeup. In CVPR, Cited by: §1.
  • [2] T. Q. Chen and M. Schmidt (2016) Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337. Cited by: §1.
  • [3] W. Cho, S. Choi, D. Park, I. Shin, and J. Choo (2019) Image-to-image translation via group-wise deep whitening-and-coloring transformation. In CVPR, Cited by: §1, §1, §2, §3.1, §4.1.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §1.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §3.1.
  • [6] S. Gu, C. Chen, J. Liao, and L. Yuan (2018) Arbitrary style transfer with deep feature reshuffle. In CVPR, Cited by: §1.
  • [7] A. W. Harley, K. G. Derpanis, and I. Kokkinos (2017) Segmentation-aware convolutional networks using local attention masks. In ICCV, Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §4.1.
  • [9] X. Huang and S. J. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization.. In ICCV, Cited by: §1, §1, §2, §4.1.
  • [10] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §1, §1, §2, §3.1.
  • [11] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §4.1.
  • [12] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. (2009) What is the best multi-stage architecture for object recognition?. In ICCV, Cited by: §1.
  • [13] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool (2016) Dynamic filter networks. In NIPS, Cited by: §2.
  • [14] D. Kang, D. Dhar, and A. Chan (2017) Incorporating side information by adaptive convolution. In NIPS, Cited by: §2.
  • [15] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. Cited by: §1.
  • [16] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §1, §2.
  • [17] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [19] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §1, §1, §2, §3.1.
  • [20] J. Lin, Y. Xia, T. Qin, Z. Chen, and T. Liu (2018) Conditional image-to-image translation. In CVPR, Cited by: §1.
  • [21] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In NIPS, Cited by: §2.
  • [22] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §4.1, §4.1, §4.2, Table 1.
  • [23] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. V. Gool (2019) Exemplar guided unsupervised image-to-image translation with semantic consistency. In ICLR, Cited by: §1, §1.
  • [24] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In ICCV, Cited by: §3.1.
  • [25] S. Mo, M. Cho, and J. Shin (2018) InstaGAN: instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889. Cited by: §2.
  • [26] H. Su, V. Jampani, D. Sun, O. Gallo, E. Learned-Miller, and J. Kautz (2019) Pixel-adaptive convolutional neural networks. arXiv preprint arXiv:1904.05373. Cited by: §2.
  • [27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §4.1.
  • [28] M. J. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, and S. Belongie (2017) BAM! the behance artistic media dataset for recognition beyond photography. In ICCV, Cited by: Figure 5, §4.1.
  • [29] T. Xiao, J. Hong, and J. Ma (2018) ELEGANT: exchanging latent encodings with gan for transferring multiple face attributes. In ECCV, Cited by: §1.
  • [30] T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NIPS, Cited by: §2.
  • [31] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. Cited by: §1, §2, §3.1, §4.1.

6 Appendix

6.1 Analysis on existing methods

In order to intensively comprehend the existing methods, this section reviews their principal operations and performs the comparative analysis of them.

6.1.1 Review on baselines

The previous methods are typically composed of two steps, of which the first step is to normalize the content feature, and the second step is to reflect the style feature to the normalized content. We formulate this procedure as f(g(c),s), where g and f represent the standardization and the style injection function, respectively. In this point of view, AdaIN can be illustrated as

gAdaIN(c)=c-μH,W(c)σH,W(c), fAdaIN(g(c),s)=σH,W(s)(g(c))+μH,W(s), (10)

where H and W are the height and the width of an input feature. Each channel is normalized and combined independently. σH,W and μH,W respectively denote the standard deviation and the mean computed along the H and W dimensions. In Eq. (10), the function g normalizes an input content feature with the channel-wise mean and variance. On the other hand, the function f transfers the mean μH,W(s) and the variance σH,W(s) of the style to those of the normalized content g(c). Meanwhile, GDWCT can be represented as

gGDWCT(c)=QcΛc-12QcT(c-μH,W(c)), fGDWCT(g(c),s)=QsΛs12QsTg(c)+μH,W(s), (11)

where the matrices {QcΛcQcT,QsΛsQsT} can be obtained by the eigendecomposition of the channel covariance matrix of the content and the style features, respectively. Each of {Qc,Qs} indicates a square matrix composed of the eigenvectors, and {Λc,Λs} are diagonal matrices whose each diagonal entry indicates an eigenvalue of a corresponding eigenvector in {Qc,Qs}. In Eq. (11), the function g plays a similar role to Eq. (10), but forces the more strict rule, so it normalizes not only the mean and the variance but also the covariance of an input feature by making its covariance matrix the identity matrix. As for the style injection function fGDWCT, it matches the first and the second-order statistics of normalized content feature to those of the style feature.

6.1.2 Comparative analysis on baselines

The differences of the existing methods are clear when we regard those methods as a special case of the convolution operation. fAdaIN in Eq. (10) can be represented as the 1×1 depth-wise convolution with the bias since adaptive parameters of fGDWCT identically scale and shift along channels. Meanwhile, fGDWCT in Eq. (11) can be viewed as the 1×1 convolution layer, of which the weights are QsΛ12sQsT and the bias is μH,W(s). This is because the vector-matrix multiplication of a row vector of QsΛ12sQsTC×C by the matrix g(c)C×HW generates a new row vector in 1×HW. This is identical to the 1×1 convolution operation, whose the output channel is one. From the aforementioned view, we can intensively explore these style injection functions. fAdaIN can be expected to transfer the lowest amount of style as it injects the style along channel, such that it engenders a relatively high consistency with the content compared to other methods. On the other hand, fGDWCT can be thought as a stronger combining method than fAdaIN because it generates the channel dimension of the content feature as a linear combination of the content feature channels. Even though GDWCT accomplishes more drastic changes of the style compared to AdaIN since it carries out mixing channel information of the content, we claim that even more dramatic changes can be achieved if the spatial information is simultaneously considered. Hence, we propose n×n adaptive convolution-based normalization, whose weights are extracted from the style. We believe this can increase a transferring capacity of a given style.

6.2 Discussion on branch-separation

Fully exploiting the adaptive convolution-based normalization at the intermediate layers may engender considerable distortions of the content information because the spatial information as well as the channel information of the output features of AdaCoN is entirely different from those of the input features. Considering one of the task objectives is maintaining an input identity, we posit that a combination of the adaptive convolution-based normalization with the general convolution layer is reasonable choice for performing the translation. Moreover, through separating branches, we can control the amount of the style injection by changing O that indicates the number of style dimensions. That is, the small O gives rise to the low injection of the style.

6.3 Additional results

Fig. 78 show the additional results of AdaCoN on the various image translation tasks.

Figure 7: Extra results of our model on CelebA dataset.
Figure 8: Extra results of our model on CelebA dataset.