Abstract
Disentangling content and style information of an image has played animportant role in recent success in image translation. In this setting, how toinject given style into an input image containing its own content is animportant issue, but existing methods followed relatively simple approaches,leaving room for improvement especially when incorporating significant stylechanges. In response, we propose an advanced normalization technique based onadaptive convolution (AdaCoN), in order to properly impose style informationinto the content of an input image. In detail, after locally standardizing thecontent representation in a channelwise manner, AdaCoN performs adaptiveconvolution where the convolution filter weights are dynamically estimatedusing the encoded style representation. The flexibility of AdaCoN can handlecomplicated image translation tasks involving significant style changes. Ourqualitative and quantitative experiments demonstrate the superiority of ourproposed method against various existing approaches that inject the style intothe content.
Quick Read (beta)
Unpaired Image Translation via Adaptive Convolutionbased Normalization
Supplementary material
Abstract
Disentangling content and style information of an image has played an important role in recent success in image translation. In this setting, how to inject given style into an input image containing its own content is an important issue, but existing methods followed relatively simple approaches, leaving room for improvement especially when incorporating significant style changes. In response, we propose an advanced normalization technique based on adaptive convolution (AdaCoN), in order to properly impose style information into the content of an input image. In detail, after locally standardizing the content representation in a channelwise manner, AdaCoN performs adaptive convolution where the convolution filter weights are dynamically estimated using the encoded style representation. The flexibility of AdaCoN can handle complicated image translation tasks involving significant style changes. Our qualitative and quantitative experiments demonstrate the superiority of our proposed method against various existing approaches that inject the style into the content.
Unpaired Image Translation via Adaptive Convolutionbased Normalization
Wonwoong Cho^{†}^{†}thanks: These authors contributed equally, Kangyeol Kim^{†}^{†}footnotemark: , Eungyeup Kim, Hyunwoo J. Kim, Jaegul Choo Korea University {tyflehd21,kky1994,yhy1254,hyunwoojkim,jchoo}@korea.ac.kr
noticebox[b]\[email protected]
1 Introduction
Recently, unpaired imagetoimage translation Zhu et al. (2017); Kim et al. (2017); Choi et al. (2018) has been actively studied as one of the major research areas. It aims to learn interdomain mappings without paired images, such that deep neural networks can translate a given image from one domain to another (e.g., real photo $\Rightarrow $ artwork). However, these methods bear a fundamental limitation of generating a unimodal output given a single image even if multiple diverse outputs may exist. In response, several approaches Lin et al. (2018); Huang et al. (2018); Lee et al. (2018); Xiao et al. (2018); Ma et al. (2019); Chang et al. (2018) have been proposed to achieve the multimodality that indicates the capability of generating multiple outputs given a single input image by taking an additional input, such as an exemplar image conveying detailed style information to transfer.
Although exemplarbased image translation achieves multimodality of outputs owing to its flexibility in reflecting the exemplar image that gives fine details of intended style, there still remains the issue of how to properly impose the style feature extracted from an exemplar image into a content image. Previous approaches Lee et al. (2018); Huang and Belongie (2017); Cho et al. (2019) commonly follows two steps of first standardizing features and then applying a particular transformation, where the first step can be regarded as removing the existing style information of an input image and the second step plays a role of imposing the exemplar style to the styleneutralized input feature.
As one of the stateoftheart methods, adaptive instance normalization (AdaIN) Huang and Belongie (2017) has been successfully utilized to combine content and style in a slew of studies Huang et al. (2018); Ma et al. (2019); Karras et al. (2019). AdaIN incorporates different features by matching each channel’s firstorder statistics , e.g., the mean and the variance, in the content to those in the style. To this end, AdaIN first standardizes each channel of content feature and adaptively performs channelwise scaling and shifting using the parameters regressed by the style feature. Another recently proposed method called groupwise deep whiteningandcoloring transformation (GDWCT) Cho et al. (2019) has shown superior capability of imposing drastically different styles by matching higherorder statistics such as covariance, which we call the coloring transformation, in addition to the firstorder ones.
In the above methods, we claim that the second step of imposing the target statistics can be viewed as a simpler variant or a special case of a convolution operation, as illustrated in Fig. 1. That is, (a) the channelwise affine transformation used in AdaIN can be viewed as the channelwise $1\times 1$ convolution. (b) On the other hand, the coloring transformation of GDWCT, which matches the target covariance, can be considered as the $1\times 1$ convolution operation that generates each output channel as a linear combination of the entire input channels.^{1}^{1} 1 The additional illustration can be found in Appendix. However, these methods tend to fail in handling a dramatic shape change because the methods have limited capability in translating significant transfiguration.
From this unified perspective of a convolution operation, these existing methods relied only on its simpler forms with only using $1\times 1$ convolution filters, and thus, the potentials of leveraging general convolution operations with largerthan$1\times 1$ filters when injecting target style has not yet been fully explored.
Inspired by this, we propose an adaptive convolutionbased normalization (AdaCoN) as an advanced method to inject the target style to a given image. AdaCoN is basically composed of two steps of standardization and adaptive convolution. First, the standardization is locally performed on each subregion of an input activation map where the convolution filter is applied, similar to previous work Jarrett et al. (2009); Krizhevsky et al. (2012). Second, AdaCoN performs adaptive convolution where the (largerthan$1\times 1$) convolution filter weights are dynamically estimated using the encoded style representation.
By taking into account spatial patterns due to a convolution operation, we hypothesize that AdaCoN is capable of flexibly performing a spatiallyadaptive image translation, which can potentially handle complicated image translation tasks involving significant style changes. In this sense, AdaCoN has something in common with the recent success in patchbased style transfer Chen and Schmidt (2016); Gu et al. (2018) that dynamically applies different styles to each patch of an input image.
In order to verify the superiority of AdaCoN, we conduct both quantitative and qualitative experiments that compare different normalization methods while maintaining the same model architectures.
2 Related work
Unpaired image translation.
Unpaired image translation aims to transform an input image from one domain to another without paired images. Numerous approaches Zhu et al. (2017); Kim et al. (2017); Liu et al. (2017) have been proposed for this task. Recently, multimodal image translation methods, capable of yielding multiple different images given a particular image, have also been studied Huang et al. (2018); Lee et al. (2018); Cho et al. (2019). These studies take similar approaches to address the unimodality problem of previous methods by incorporating an exemplar image as a guidance for image translation. In addition, they assume that a latent image space can be disentangled into the content space that contains an underlying structure of images and the style space that maintains a domainspecific feature. However, they propose different methods for integrating the disentangled content feature from the input image and the style feature from the exemplar image. To be specific, inspired by AdaIN Huang and Belongie (2017), MUNIT Huang et al. (2018) adopts the idea of matching the statistics between the content and the style features. Extending this idea, GDWCT Cho et al. (2019) leverages higherorder statistics compared to the previous method, enhancing the quality of generated images. Meanwhile, DRIT Lee et al. (2018) simply concatenates the content and the style features to perform image translation. However, these methods have a limited capability to handle the drastic changes between the domains. A recently proposed method called instaGAN Mo et al. (2018) tackles this problem by taking the segmentation mask as additional input, which serves as strong hint for translation.
Adaptive convolution.
Unlike standard convolution layers where the filter weights are trainable constant values, an adaptive convolution layer uses varying filter weights dynamically determined by input data. Based on this idea, dynamic filter networks Jia et al. (2016) proposed to take an auxiliary input image to determine convolution filter weights in an video prediction task. Furthermore, Kang et al. Kang et al. (2017) showed that convolution filter weights from the side information such as camera perspective or noise level can be utilized to improve the performance of classification task. Recent studies proposed to apply adaptive convolution to a variety of tasks such as semantic segmentation Harley et al. (2017); Su et al. (2019) and motion prediction Xue et al. (2016). In this paper, we propose AdaCoN, which adaptively obtains convolution weights associated with convolutionbased normalization for an image translation task.
3 Proposed Methods
In this section, we briefly describe our backbone networks for an image translation task. Afterwards, we concretely describe our proposed method in detail.
3.1 Translation backbone
Networks overview.
Let ${x}_{A}$ and ${x}_{B}$ denote randomly sampled images from two different domains of ${\mathcal{X}}_{A}$ and ${\mathcal{X}}_{B}$, respectively. Given two images, our networks translate ${x}_{A}$ from domain ${\mathcal{X}}_{A}$ to domain ${\mathcal{X}}_{B}$ as well as ${x}_{B}$ from domain ${\mathcal{X}}_{B}$ to domain ${\mathcal{X}}_{A}$. To this end, we adopt the disentangling strategy Huang et al. (2018); Lee et al. (2018); Cho et al. (2019) that decomposes an image into a domaininvariant content feature (e.g., an identity of a person) and a domainspecific style feature (e.g., the hair length in the female domain). This can be formulated as
${z}_{A}^{c},{z}_{A}^{s}={E}_{A}^{c}({x}_{A}),{E}_{A}^{s}({x}_{A}),{z}_{B}^{c},{z}_{B}^{s}={E}_{B}^{c}({x}_{B}),{E}_{B}^{s}({x}_{B}),$  (1) 
where {${E}_{A}^{c}$ ,${E}_{B}^{c}$} are content encoders and {${E}_{A}^{s}$, ${E}_{B}^{s}$} are style encoders. By combining the content and the style features of the different domains {$({z}_{B}^{c},{z}_{A}^{s})$, $({z}_{A}^{c},{z}_{B}^{s})$} and forwarding it to decoders {${G}_{A},{G}_{B}$}, we obtain the translated results {${x}_{B\to A}$, ${x}_{A\to B}$}, i.e.,
${x}_{A\to B}={G}_{B}(\text{AdaCoN}({z}_{A}^{c},{z}_{B}^{s})),{x}_{B\to A}={G}_{A}(\text{AdaCoN}({z}_{B}^{c},{z}_{A}^{s})),$  (2) 
where AdaCoN indicates our adaptive convolutionbased normalization that incorporates given content and style features. As shown in Fig. 2, for example, given ${\mathcal{X}}_{A}$ and ${\mathcal{X}}_{B}$ in the woman and the man domains, respectively, let us assume that our networks translate the woman to the man ${x}_{A\to B}$. (a) We first extract the content feature ${z}_{A}^{c}$ from a woman image ${x}_{A}$ and the style feature ${z}_{B}^{s}$ from a man image ${x}_{B}$ by forwarding each image into the content encoder ${E}_{A}^{c}$ and the style encoder ${E}_{B}^{s}$. (b) We next inject the style to the content feature through AdaCoN and (c) forward the combined features into the decoder ${G}_{A}$. After obtaining a fake man image ${x}_{AB}$, (d) we exploit the fake image as an input of a discriminator ${D}_{A}$ that encourages the generated image distribution to be close to the real image distribution. (e) Lastly, we repeat the processes of (a)(c) in order to obtain a reconstructed woman image ${x}_{ABA}$, enabling our networks to maintain an original identity. In this manner, our networks are trained to translate the images between two different domains.
Loss functions.
Our networks are composed of several losses, and each term plays a crucial role in appropriately training our networks. In order to avoid redundancy, we focus on a translation of $({\mathcal{X}}_{A}\to {\mathcal{X}}_{B}\to {\mathcal{X}}_{A})$ from this point on. First, we leverage the pixellevel reconstruction losses, such as the cycleconsistency loss and the identity loss Zhu et al. (2017) in order to guarantee the highquality of generated images. The image reconstruction losses can be represented as
${\mathcal{L}}_{cyc}^{A\to B\to A}=\mathbb{E}\left[{\parallel {x}_{A\to B\to A}{x}_{A}\parallel}_{1}\right],{\mathcal{L}}_{i}^{A\to A}=\mathbb{E}\left[{\parallel {x}_{A\to A}{x}_{A}\parallel}_{1}\right].$  (3) 
We also use latentlevel reconstruction losses that encourage the networks to impose style information while maintaining the original content during the forwarding phase. First, the style reconstruction loss is computed between the style features of $({z}_{A\to B}^{s},{z}_{B}^{s})$, which makes our networks properly reflect the style because ${z}_{A\to B}^{s}$ is constrained to be equivalent to ${z}_{B}^{s}$. Second, the content reconstruction loss is computed between $({z}_{A}^{c},{z}_{A\to B}^{c})$, and this encourages the networks to maintain the original content ${z}_{A}^{c}$ after performing a translation. These two losses can be formulated as
${\mathcal{L}}_{s}^{A\to B}=\mathbb{E}[{\parallel {E}_{B}^{s}({x}_{A\to B}){E}_{B}^{s}({x}_{B})\parallel}_{1}],{\mathcal{L}}_{c}^{A\to B}=\mathbb{E}[{\parallel {E}_{B}^{c}({x}_{A\to B}){E}_{A}^{c}({x}_{A})\parallel}_{1}]$  (4) 
Lastly, the adversarial loss Goodfellow et al. (2014) is used to minimize the distance of the two distributions of the real images in a target domain and the generated images. For this purpose, we exploit LSGAN Mao et al. (2017) as our adversarial loss, i.e.,
${\mathcal{L}}_{{D}_{adv}}^{B}=\frac{1}{2}\mathbb{E}[{(D({x}_{B})1)}^{2}]+\frac{1}{2}{\mathbb{E}}_{{x}_{A\to B}}[{(D({x}_{A\to B}))}^{2}],{\mathcal{L}}_{{G}_{adv}}^{B}=\frac{1}{2}\mathbb{E}[{(D({x}_{A\to B})1)}^{2}]$  (5) 
Note that our translation backbone is trained to translate in both directions of $({\mathcal{X}}_{A}\to {\mathcal{X}}_{B}\to {\mathcal{X}}_{A})$ and $({\mathcal{X}}_{B}\to {\mathcal{X}}_{A}\to {\mathcal{X}}_{B})$. Finally, our full loss is formulated as
${\mathcal{L}}_{D}$  $={\mathcal{L}}_{{D}_{adv}}^{A}+{\mathcal{L}}_{{D}_{adv}}^{B}$  (6)  
${\mathcal{L}}_{G}$  $={\mathcal{L}}_{{G}_{adv}}^{A}+{\mathcal{L}}_{{G}_{adv}}^{B}+{\lambda}_{latent}({\mathcal{L}}_{s}+{\mathcal{L}}_{c})+{\lambda}_{pixel}({\mathcal{L}}_{cyc}+{\mathcal{L}}_{i}^{A\to A}+{\mathcal{L}}_{i}^{B\to B}),$  (7) 
where each term without the domain notation is bidirectionally applied within two different domains, and we empirically set ${\lambda}_{latent}=1$ and ${\lambda}_{pixel}=10$.
3.2 Adaptive convolutionbased normalization (AdaCoN)
The goal of AdaCoN is to produce an output feature ${z}_{out}$ that can reflect the style of ${z}_{s}$ while maintaining the identity of ${z}_{c}$. The combined feature ${z}_{out}$ is used as input to a decoder to generate a translated image. Note that we omit the domain notation in this section for brevity.
3.2.1 Basic components
As illustrated in Fig. 2 (h), AdaCoN is composed of a style branch (h1) to reflect the style and a content branch (h2) that aims to maintain the content identity. Given the content ${z}_{c}$ and the style ${z}_{s}$, the style branch learns to inject the style into the content. On the other hand, the content branch learns to keep the essential information of the given ${z}_{c}$, so that the output of AdaCoN can maintain its original identity. Lastly, In the joining step (h3), the outputs of the branches are concatenated and forwarded into a subsequent convolution layer. Note that an additional analysis of this structure is provided in Appendix.
3.2.2 Style branch
Standardization function.
${g}_{\text{AdaCoN}}$ normalizes the content feature ${z}_{c}$ before applying adaptive convolution. Specifically, we compute the statistics of ${z}_{c}$ from the channelwise local patch of the size $kH\times kW$, where $kH$ and $kW$ are a kernel height and a kernel width, respectively. We use ${g}_{\text{AdaCoN}}$ because locally computed statistics can be more effective in normalizing a given feature than globally computed ones. Our standardization is formulated as
${\overline{z}}_{c}={g}_{\text{AdaCoN}}({z}_{c})={\displaystyle \frac{\varphi ({z}_{c}){\mu}_{kH,kW}(\varphi ({z}_{c}))}{{\sigma}_{kH,kW}(\varphi ({z}_{c}))}},$  (8) 
where $\varphi $ denotes an unfolding operation that amasses every patch of ${z}_{c}$ and unites it into one tensor. Fig. 3(a) concretely describes the procedure. (a1) given ${z}_{c}\in {\mathbb{R}}^{C\times H\times W}$, (a2) $\varphi $ extracts each sliding local block in ${\mathbb{R}}^{C\times kH\times kW}$ from the zeropadded ${z}_{c}$ and the extracted blocks are united into one tensor in ${\mathbb{R}}^{H\times W\times C\times kH\times kW}$. (a3) In order to perform the standardization, we compute the mean and the standard deviation along the dimensions of $kH\times kW$. (a4) We then normalize the content feature by exploiting its local channelwise statistics. That is, ${g}_{\text{AdaCoN}}$ performs a local normalization by using statistics specified in local patch. Note that $H$ and $W$ dimensions of $\varphi ({z}_{c})$ imply a spatial coordinate of the local patch where it is extracted from, such that the number of patches is equivalent to $H\times W$. (a5) Finally, we obtain the patchwisely normalized feature in ${\mathbb{R}}^{C\cdot kH\cdot kW\times H\times W}$.
Adaptive convolution layer.
${f}_{\text{AdaCoN}}$ takes ${z}_{s}$ and ${z}_{c}$ as inputs and generates a stylized feature ${z}_{cs}$ as output. Specifically, as illustrated in Fig. 3 (b1), ${f}_{\text{AdaCoN}}$ first takes the style feature ${z}_{s}\in {\mathbb{R}}^{C\times kH\times kW}$ as an input and (b2) encodes ${z}_{s}$ to the convolution weights ${\widehat{z}}_{s}\in {\mathbb{R}}^{C\times O\times kH\times kW}$, where $O$ is the number of output channels. Lastly, after unfolding it to the dimensions of ${\mathbb{R}}^{C\cdot kH\cdot kW\times 1\times 1}$, we apply this weights as the form of the convolution operation and obtain the stylized feature ${z}_{cs}$ (b3b5). Finally, the adaptive convolution is formulated as
${\widehat{z}}_{s}=\psi ({z}_{s}),{f}_{\text{AdaCoN}}({\overline{z}}_{c},{\widehat{z}}_{s})={\displaystyle \sum _{k=1}^{kH}}{\displaystyle \sum _{l=1}^{kW}}[{\overline{z}}_{c}(i+k1,j+l1){\widehat{z}}_{s}(k,l,n)],$  (9) 
$\mathrm{\hspace{0.33em}}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{0.25em}}fori=1,\mathrm{\dots},H,j=1,\mathrm{\dots},W,andn=1,\mathrm{\dots},O,$ 
where $\psi $ represents a function that learns to properly encode a given style ${z}_{s}$ as the convolution weight ${\widehat{z}}_{s}$ of ${f}_{\text{AdaCoN}}$. $i$ and $j$ indicates the horizontal and the vertical coordinates, respectively, and $H$ and $W$ are the height and the width of ${\overline{z}}_{c}$, respectively. Lastly, we add the mean of the style ${\mu}_{s}$ to the stylized feature ${z}_{cs}$ that can be viewed as a bias in the convolution operation.
4 Experiments
This section describes the dataset and the baseline models we used for the experiments in Section 4.1. Subsequently, we discuss the comparison results with the baselines in Section 4.2. Lastly, we analyze our proposed method in detail in Section 4.3.
4.1 Experimental settings
Dataset
We conduct evaluations with diverse datasets. First, we use CelebA dataset Liu et al. (2015). This is a widelyused facial dataset involving multiple attributes. In order to construct a dataset with a large domain gap, we combine several attributes and newly form the dataset, such as (Male, NonBangs, NonSmiling$\Rightarrow $Female, Bangs, Smiling). Second, we use BAM dataset Wilber et al. (2017), composed of numerous artworks labeled with its artistic style, such as watercolor and vectorgraphic. We use Watercolor $\iff $ Pen, Vector $\iff $ Pen, and Oil $\iff $ Pen, in order to demonstrate AdaCoN can perform image translation with a substantial domain difference. Finally, Edges $\iff $ Handbag Zhu et al. (2017) and Summer $\iff $ Winter Isola et al. (2017) datasets are used to confirm the wide applicability of AdaCoN in diverse image translation tasks. We commonly set the size of the image as $256\times 256$ in all the experiments.
Baseline methods
We compare our proposed method with the AdaIN Huang and Belongie (2017) exploited in MUNIT, and GDWCT Cho et al. (2019). The main difference among them lies in a specific method of combining the content feature with the style feature. As for the settings of ours, we explore various settings by adjusting the hyperparameters, such as the kernel size $\{3,7,11\}$ and the style dimension $\{8,64,128\}$ of AdaCoN. We empirically set the kernel size of 3 and the style dimension of 128. The specific results of those hyperparmeters are reported in the Section 4.3.
Training details
For training the models, we exploit the Adam optimizer (Kingma and Ba, 2015) with ${\beta}_{1}=0.5$ and ${\beta}_{2}=0.999$. We empirically adopt the initialization method (He et al., 2015) for initializing our models. We also set one for the batch size and 0.0001 for the learning rate. We regularly decay the learning rate by half in every 50,000 iteration and the decaying is started from 200,000 iterations. Every model exploited in the experiments are trained for 500,000 iterations on a NVIDIA TITAN Xp GPU for 90 hours.
Evaluation metric
In order to evaluate the methods, we measure the the classification accuracy as well as content distance using a pretrained Inceptionv3 model Szegedy et al. (2016). To be specific, the content distance is measured by computing L2 distance of the features from intermediate layer of Inceptionv3 between the input images and the translated ones. A lower content distance indicates that the gap between the them is relatively small. On the other hand, the evaluation on style injection is measured by the classification accuracy. This is because a welltrained image translation model can transform the domain of input image, so that a higher classification accuracy shows that the translation model successfully generates the prominent characteristics of the target domain. For training the classification model, we exploit the pretrained Inceptionv3 and finetuned on CelebA dataset Liu et al. (2015). To evaluate the performance on multiattribute translation task, we train the classifiers with multilabel dataset.
4.2 Baseline comparison
This section reports the comparison results of AdaCoN with other baseline methods. Quantitative results using the classification accuracy and the content distance are described in Section 4.2.1 and the qualitative results on CelebA dataset Liu et al. (2015) is reported in Section 4.2.2.
Method  ${\mathbf{G}}_{\mathrm{\U0001d7cf}}\Rightarrow {\mathbf{G}}_{\mathrm{\U0001d7d0}}$  ${\mathbf{G}}_{\mathrm{\U0001d7d0}}\Rightarrow {\mathbf{G}}_{\mathrm{\U0001d7cf}}$  ${\mathbf{D}}_{\mathrm{\U0001d7cf}}\Rightarrow {\mathbf{D}}_{\mathrm{\U0001d7d0}}$  ${\mathbf{D}}_{\mathrm{\U0001d7d0}}\Rightarrow {\mathbf{D}}_{\mathrm{\U0001d7cf}}$  ${\mathbf{T}}_{\mathrm{\U0001d7cf}}\Rightarrow {\mathbf{T}}_{\mathrm{\U0001d7d0}}$  ${\mathbf{T}}_{\mathrm{\U0001d7d0}}\Rightarrow {\mathbf{T}}_{\mathrm{\U0001d7cf}}$  ${\mathbf{Z}}_{\mathrm{\U0001d7cf}}\Rightarrow {\mathbf{Z}}_{\mathrm{\U0001d7d0}}$  ${\mathbf{Z}}_{\mathrm{\U0001d7d0}}\Rightarrow {\mathbf{Z}}_{\mathrm{\U0001d7cf}}$ 

AdaIN  0.173/89.5  0.179/88.9  0.162/53.8  0.166/58.5  0.196/67.5  0.195/51.6  0.192/33.8  0.191/82.9 
GDWCT  0.174/90.6  0.190/90.4  0.173/52.3  0.175/64.4  0.202/64.9  0.200/47.9  0.197/30.5  0.199/85.6 
AdaCoN  0.186/91.6  0.184/90.0  0.202/62.3  0.202/66.5  0.193/67.7  0.197/57.5  0.199/36.5  0.201/86.7 

4.2.1 Quantitative comparison
The classification accuracy increase when a translated output is correctly classified over every target attribute. As shown in Table. 1, our model displays the higher classification accuracy than other baselines. Moreover, the gap between AdaCoN and other baselines tends to be larger in multiattribute translation task than the single attribute translation one. We believe this is because the multiattribute translation tasks demand more considerable style injection than the singleattribute translation. For example, in case of $({Z}_{1}\Rightarrow {Z}_{2})$, in order to translate an image to the target attributes, the translation networks must change the regions of the manly characteristics, the hair, and the mouth. On the other hand, the case of $({G}_{1}\Rightarrow {G}_{2})$ requires to change only the regions of the manly characteristics, of which the amount of changes the task demands is relatively small. As for the content distance, even though AdaCoN obtains the highest score in the content distance in most translation cases, the small amount of differences ensures that AdaCoN can maintain contentidentity. Considering our objective is strong reflection of the style, it is tolerable to lose the small amount of content information.
4.2.2 Qualitative comparison
Fig. 4 shows the comparison results of AdaCoN with baselines on various attribute translation cases. The results demonstrate that AdaCoN can significantly reflect the style compared to baselines. For example, in case of (c) in the left macro column, whose the target attributes are (Bald, NonEyeglasses, Old), AdaCoN considerably applies the style of the exemplar, such that the result of AdaCoN represents the bald and old man without the eyeglasses. However, both AdaIN and GDWCT keep the hair even though the style of the exemplar includes the bald attribute. On the other hand, (a) in the right macro column, of which the target attribute is Male shows the difference of the amount of the style reflection between baselines. Specifically, in order to transfer the style of man, every baseline removes the makeup. Furthermore, AdaIN makes the beard while keeping the hair length long. GDWCT incompletely removes hair region while AdaCoN clearly removes the hair region. Since the long hair is the dominant characteristic of woman, the output of AdaCoN changed to short hair verifies the superior performance of AdaCoN in style reflection.
4.3 Additional analysis
Effects of kernel size.
As shown in Fig. 5(a), the kernel size is relevant to the spatialawareness. In the first row, the hair color on the chest of the woman of the content image is different from the other hair color of hers. Because the small receptive field is disadvantageous to recognizing the wide hair region, K3 fails in generating the hair on the chest naturally. On the other hand, K11 shows the better results in generating the hair region because it has the larger receptive field. Furthermore, we observe that the larger kernel size engenders the larger amount of style reflection. For instance, the results of K11 more strongly reflect the style, so that it distorts the eye and mouth of the content in the first row and represents more conspicuous texture in the second row, compared to the results of K3.
Effects of standardization function.
Fig. 5(b) shows the effects of the standardization function of AdaCoN. ${\text{AdaCoN}}^{g}$ represents the results from a model trained without the standardization function ${g}_{\text{AdaCoN}}$. As shown in the results in both rows, ${g}_{\text{AdaCoN}}$ plays essential role in injecting a style because the model trained without the standardization function ${g}_{\text{AdaCoN}}$ fails in performing a translation. We believe this is attributed to the conflicts of the style features between the content (input) and the style (exemplar) images. Specifically, the input image has both the content and the style features, so that if its style feature is not removed by ${g}_{\text{AdaCoN}}$, the style feature extracted from the exemplar can give rise to the degradation of the style reflection performance. As a consequence, the results demonstrate that our proposed standardization function based on local normalization is essential in AdaCoN.
Effects of style dimension ($O$) and results on diverse dataset.
We compare the effects of $O$ that indicates the number of channels of ${z}_{cs}$. As discussed in Appendix, $O$ determines the extent of the style reflection to the output of AdaCoN ${z}_{out}$. As illustrated in Fig. 6(a), the results verify that the amount of the style reflection is directly affected by $O$. For instance, (a1) shows the hair region of $O=128$ is clearly removed while $O=8$ relatively keeps hair region. We further observe that a beard, the other dominant characteristic of man, is rather transferred in $O=8$. This shows that the low dimension of ${z}_{cs}$ tends to translate the domain with the minimum change. That is, this result demonstrates that the size of $O$ has a positive correlation with the amount of the style reflection, such that it can be usefully exploited when attempting to control the extent of the style reflection. Meanwhile, in order to verify AdaCoN can be exploited widely as well as robustly along the diverse dataset, we conduct the experiment in Fig. 6(b). The results consistently show that AdaCoN can translate a given image with a rich style.
5 Conclusion
In this paper, we proposed the novel normalization method that can dramatically inject the style of the given exemplar in a image translation. AdaCoN locally performs the standardization of the content representation in order to properly reflect the given style, and the adaptive convolution layer, whose weights are dynamically extracted from the style encoding is applied to the standardized feature. We verify the superior performance of AdaCoN in drastic style injection through the experiments. We believe AdaCoN can be usefully exploited in diverse challenging image translation tasks that have a large gap between a source and a target domain, such as the multiattribute translation. Finally, AdaCoN can be potentially used by incorporating an additional information with our novel normalization technique in various tasks such as object detection and semantic segmentation.
References
 [1] (2018) PairedCycleGAN: asymmetric style transfer for applying and removing makeup. In CVPR, Cited by: §1.
 [2] (2016) Fast patchbased style transfer of arbitrary style. arXiv preprint arXiv:1612.04337. Cited by: §1.
 [3] (2019) Imagetoimage translation via groupwise deep whiteningandcoloring transformation. In CVPR, Cited by: §1, §1, §2, §3.1, §4.1.
 [4] (2018) StarGAN: unified generative adversarial networks for multidomain imagetoimage translation. In CVPR, Cited by: §1.
 [5] (2014) Generative adversarial nets. In NIPS, Cited by: §3.1.
 [6] (2018) Arbitrary style transfer with deep feature reshuffle. In CVPR, Cited by: §1.
 [7] (2017) Segmentationaware convolutional networks using local attention masks. In ICCV, Cited by: §2.
 [8] (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In ICCV, Cited by: §4.1.
 [9] (2017) Arbitrary style transfer in realtime with adaptive instance normalization.. In ICCV, Cited by: §1, §1, §2, §4.1.
 [10] (2018) Multimodal unsupervised imagetoimage translation. In ECCV, Cited by: §1, §1, §2, §3.1.
 [11] (2017) Imagetoimage translation with conditional adversarial networks. In CVPR, Cited by: §4.1.
 [12] (2009) What is the best multistage architecture for object recognition?. In ICCV, Cited by: §1.
 [13] (2016) Dynamic filter networks. In NIPS, Cited by: §2.
 [14] (2017) Incorporating side information by adaptive convolution. In NIPS, Cited by: §2.
 [15] (2019) A stylebased generator architecture for generative adversarial networks. Cited by: §1.
 [16] (2017) Learning to discover crossdomain relations with generative adversarial networks. In ICML, Cited by: §1, §2.
 [17] (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
 [18] (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
 [19] (2018) Diverse imagetoimage translation via disentangled representations. In ECCV, Cited by: §1, §1, §2, §3.1.
 [20] (2018) Conditional imagetoimage translation. In CVPR, Cited by: §1.
 [21] (2017) Unsupervised imagetoimage translation networks. In NIPS, Cited by: §2.
 [22] (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §4.1, §4.1, §4.2, Table 1.
 [23] (2019) Exemplar guided unsupervised imagetoimage translation with semantic consistency. In ICLR, Cited by: §1, §1.
 [24] (2017) Least squares generative adversarial networks. In ICCV, Cited by: §3.1.
 [25] (2018) InstaGAN: instanceaware imagetoimage translation. arXiv preprint arXiv:1812.10889. Cited by: §2.
 [26] (2019) Pixeladaptive convolutional neural networks. arXiv preprint arXiv:1904.05373. Cited by: §2.
 [27] (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §4.1.
 [28] (2017) BAM! the behance artistic media dataset for recognition beyond photography. In ICCV, Cited by: Figure 5, §4.1.
 [29] (2018) ELEGANT: exchanging latent encodings with gan for transferring multiple face attributes. In ECCV, Cited by: §1.
 [30] (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NIPS, Cited by: §2.
 [31] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. Cited by: §1, §2, §3.1, §4.1.
6 Appendix
6.1 Analysis on existing methods
In order to intensively comprehend the existing methods, this section reviews their principal operations and performs the comparative analysis of them.
6.1.1 Review on baselines
The previous methods are typically composed of two steps, of which the first step is to normalize the content feature, and the second step is to reflect the style feature to the normalized content. We formulate this procedure as $f(g(c),s)$, where $g$ and $f$ represent the standardization and the style injection function, respectively. In this point of view, AdaIN can be illustrated as
${g}_{\text{AdaIN}}(c)={\displaystyle \frac{c{\mu}_{H,W}(c)}{{\sigma}_{H,W}(c)}},$  ${f}_{\text{AdaIN}}(g(c),s)={\sigma}_{H,W}(s)(g(c))+{\mu}_{H,W}(s),$  (10) 
where $H$ and $W$ are the height and the width of an input feature. Each channel is normalized and combined independently. ${\sigma}_{H,W}$ and ${\mu}_{H,W}$ respectively denote the standard deviation and the mean computed along the $H$ and $W$ dimensions. In Eq. (10), the function $g$ normalizes an input content feature with the channelwise mean and variance. On the other hand, the function $f$ transfers the mean ${\mu}_{H,W}(s)$ and the variance ${\sigma}_{H,W}(s)$ of the style to those of the normalized content $g(c)$. Meanwhile, GDWCT can be represented as
${g}_{\text{GDWCT}}(c)={Q}_{c}{\mathrm{\Lambda}}_{c}^{\frac{1}{2}}{Q}_{c}^{T}(c{\mu}_{H,W}(c)),$  ${f}_{\text{GDWCT}}(g(c),s)={Q}_{s}{\mathrm{\Lambda}}_{s}^{\frac{1}{2}}{Q}_{s}^{T}g(c)+{\mu}_{H,W}(s),$  (11) 
where the matrices $\{{Q}_{c}{\mathrm{\Lambda}}_{c}{Q}_{c}^{T},{Q}_{s}{\mathrm{\Lambda}}_{s}{Q}_{s}^{T}\}$ can be obtained by the eigendecomposition of the channel covariance matrix of the content and the style features, respectively. Each of $\{{Q}_{c},{Q}_{s}\}$ indicates a square matrix composed of the eigenvectors, and $\{{\mathrm{\Lambda}}_{c},{\mathrm{\Lambda}}_{s}\}$ are diagonal matrices whose each diagonal entry indicates an eigenvalue of a corresponding eigenvector in $\{{Q}_{c},{Q}_{s}\}$. In Eq. (11), the function $g$ plays a similar role to Eq. (10), but forces the more strict rule, so it normalizes not only the mean and the variance but also the covariance of an input feature by making its covariance matrix the identity matrix. As for the style injection function ${f}_{\text{GDWCT}}$, it matches the first and the secondorder statistics of normalized content feature to those of the style feature.
6.1.2 Comparative analysis on baselines
The differences of the existing methods are clear when we regard those methods as a special case of the convolution operation. ${f}_{\text{AdaIN}}$ in Eq. (10) can be represented as the $1\times 1$ depthwise convolution with the bias since adaptive parameters of ${f}_{\text{GDWCT}}$ identically scale and shift along channels. Meanwhile, ${f}_{\text{GDWCT}}$ in Eq. (11) can be viewed as the $1\times 1$ convolution layer, of which the weights are ${Q}_{s}{\mathrm{\Lambda}}^{{\frac{1}{2}}_{s}}{Q}_{s}^{T}$ and the bias is ${\mu}_{H,W}(s)$. This is because the vectormatrix multiplication of a row vector of ${Q}_{s}{\mathrm{\Lambda}}^{{\frac{1}{2}}_{s}}{Q}_{s}^{T}\in {\mathbb{R}}^{C\times C}$ by the matrix $g(c)\in {\mathbb{R}}^{C\times HW}$ generates a new row vector in ${\mathbb{R}}^{1\times HW}$. This is identical to the $1\times 1$ convolution operation, whose the output channel is one. From the aforementioned view, we can intensively explore these style injection functions. ${f}_{\text{AdaIN}}$ can be expected to transfer the lowest amount of style as it injects the style along channel, such that it engenders a relatively high consistency with the content compared to other methods. On the other hand, ${f}_{\text{GDWCT}}$ can be thought as a stronger combining method than ${f}_{\text{AdaIN}}$ because it generates the channel dimension of the content feature as a linear combination of the content feature channels. Even though GDWCT accomplishes more drastic changes of the style compared to AdaIN since it carries out mixing channel information of the content, we claim that even more dramatic changes can be achieved if the spatial information is simultaneously considered. Hence, we propose $n\times n$ adaptive convolutionbased normalization, whose weights are extracted from the style. We believe this can increase a transferring capacity of a given style.
6.2 Discussion on branchseparation
Fully exploiting the adaptive convolutionbased normalization at the intermediate layers may engender considerable distortions of the content information because the spatial information as well as the channel information of the output features of AdaCoN is entirely different from those of the input features. Considering one of the task objectives is maintaining an input identity, we posit that a combination of the adaptive convolutionbased normalization with the general convolution layer is reasonable choice for performing the translation. Moreover, through separating branches, we can control the amount of the style injection by changing $O$ that indicates the number of style dimensions. That is, the small $O$ gives rise to the low injection of the style.