Guided Image Generation with Conditional Invertible Neural Networks

  • 2019-07-04 13:20:57
  • Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, Ullrich Köthe
  • 19

Abstract

In this work, we address the task of natural image generation guided by aconditioning input. We introduce a new architecture called conditionalinvertible neural network (cINN). The cINN combines the purely generative INNmodel with an unconstrained feed-forward network, which efficientlypreprocesses the conditioning input into useful features. All parameters of thecINN are jointly optimized with a stable, maximum likelihood-based trainingprocedure. By construction, the cINN does not experience mode collapse andgenerates diverse samples, in contrast to e.g. cGANs. At the same time ourmodel produces sharp images since no reconstruction loss is required, incontrast to e.g. VAEs. We demonstrate these properties for the tasks of MNISTdigit generation and image colorization. Furthermore, we take advantage of ourbi-directional cINN architecture to explore and manipulate emergent propertiesof the latent space, such as changing the image style in an intuitive way.

 

Quick Read (beta)

Guided Image Generation with
Conditional Invertible Neural Networks

Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, Ullrich Köthe
Visual Learning Lab Heidelberg
Abstract

In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.

Code and appendix available at
github.com/VLL-HD/FrEIA
Correspondence to
[email protected]

\usetikzlibrary

arrows, arrows.meta, shapes, calc, positioning, decorations.pathreplacing, fit, backgrounds

Guided Image Generation with
Conditional Invertible Neural Networks

Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, Ullrich Köthe
Visual Learning Lab Heidelberg
\tikz

[remember picture, overlay] \node[rotate=180, anchor=south west, font=, text=black!50] at ((currentpage.southeast)+(-2.22cm,2.1)) Quiz solution: Bottom row, center image;

1 Introduction

Figure 1: Diverse colorizations, which our network created for the same grayscale image. One of them shows ground truth colors, but which? Solution at the bottom of the page.

Generative adversarial networks (GANs) produce ever larger and more realistic samples [20, 3]. Hence they have become the primary choice for a majority of image generation tasks. As such, their conditional variants (cGANs) would appear to be the natural tool for conditional image generation as well, and they have successfully been applied in many scenarios [28, 31]. Despite numerous improvements, significant expertise and computational resources are required to find a training configuration for large GANs that is stable, and produces diverse images. A lack in diversity is especially common when the condition itself is an image, and special precautions have to be taken to avoid mode collapse.

Conditional variational autoencoders (cVAEs) do not suffer from the same problems. Training is generally stable, and since every data point is assigned a region in latent space, sampling yields the full variety of data seen during training. However cVAEs come with drawbacks of their own: The assumption of a Gaussian posterior on the decoder side implies an L2 reconstruction loss, which is known to cause blurriness. In addition, the partition of the latent space into diagonal Gaussians leads to either mode-mixing issues or regions of poor sample quality [22]. There has also been some success in combining aspects of both approaches for certain tasks, such as [17, 43, 32].

We propose a third approach, by extending Invertible Neural Networks (INNs, [8, 21, 1]) for the task of conditional image generation, by adding conditioning inputs to their core building blocks. INNs are neural networks which are by construction bijective, efficiently invertible, and have a tractable Jacobian determinant. They represent transport maps between the input distribution p(𝐱) and a prescribed, easy-to-sample-from latent distribution p(𝐳). During training, the likelihood of training samples from p(𝐱) is maximized in latent space, while at inference time, 𝐳-samples can trivially be transformed back to the data domain. Previously, INNs have been used successfully for unconditional image generation, e.g. by [8] and [21].

Unconditional INN training is related to that of VAEs, but it compensates for some key disadvantages: Firstly, since reconstructions are perfect by design, no reconstruction loss is needed, and generated images do not become blurry. Secondly, each 𝐱 maps to exactly one 𝐳 in latent space, and there is no need for posteriors p(𝐳|𝐱). This avoids the VAE problem of disjoint or overlapping regions in latent space. In terms of training stability and sample diversity, INNs show the same strengths as autoencoder architectures, but with superior image quality. We find that these positive aspects apply to conditional INNs (cINNs) as well.

One limitation of INNs is that their design restricts the use of some standard components of neural networks, such as pooling and batch normalization layers. Our conditional architecture alleviates this problem, as the conditional inputs can be preprocessed by a conditioning network with a standard feed-forward architecture, which can be learned jointly with the cINN to greatly improve its generative capabilities. We demonstrate the qualities of cINNs for conditional image generation, and uncover emergent properties of the latent space, for the tasks of conditional MNIST generation and diverse colorization of ImageNet.

Our work makes the following contributions:

  • We propose a new architecture called conditional invertible neural network (cINN), which combines an INN with an unconstrained feed-forward network for conditioning. It generates diverse images with high realism and thus overcomes limitations of existing approaches.

  • We demonstrate a stable, maximum likelihood-based training procedure for jointly optimizing the parameters of the INN and the conditioning network.

  • We take advantage of our bidirectional cINN architecture to explore and manipulate emergent properties of the latent space. We illustrate this for MNIST digit generation and image colorization.

2 Related work

Conditional Generative Modeling. Modern generative models learn to transform noise (usually sampled from multivariate Gaussians) into desired target distributions. Methods differ by the model-family these transformations are picked from and by the losses determining optimal solutions.

Conditional generative adversarial networks (cGANs) [30] train a pair of neural networks: a generator transforms a pair of conditioning and noise vectors to images, and a discriminator penalizes unrealistic looking images. The conditioning information is either concatenated to the noise [30], or fed into the network via conditional batch-norm layers [9, 15, 32]. Ensuring diversity of the generated images (for fixed conditioning) appears to be challenging in this approach. Recent BigGANs [3] successfully address this problem by using very large networks and batch sizes, but require parallel training on up to 512 TPUs. PacGANs [29] employ augmented discriminators, which evaluate entire batches of real or generated images together rather than one image at a time. CausalGANs [24] train two additional discriminator networks, called “labeler” and “anti-labeler”, with the latter explicitly penalizing the lack of diversity. Pix2pix [17] addresses the important special case when the target is conditioned on an image in a different modality, e.g. to generate satellite images from maps. In addition to the discriminator loss, it minimizes the L1 distance between generated and ground-truth targets using a paired training set, which contains corresponding images from both modalities. This leads to impressive image quality, but lack of diversity seems to be an especially hard problem in this case. In contrast, our method does not need explicit precautions to promote diversity.

Bidirectional architectures augment generator networks with complementary encoder networks that learn the generator’s inverse and enable reconstruction losses, which exploit cycle consistency requirements. Conditional variational autoencoders (cVAEs) [37] replace all distributions in a standard VAE [23] by the appropriate conditional distributions, and are trained to minimize the evidence lower bound (ELBO loss). Since variational distributions are typically Gaussian, the reconstruction penalty is equivalent to squared loss, resulting in rather blurry generated images. This is avoided by AGE networks [38] and CycleGANs [42], which combine standard cGAN discriminators with L1 reconstruction loss in the data domain, and bidirectional conditional GANs [19], which extend the GAN discriminator to act on the distributions in data and latent space jointly. SPADE [32], building upon pix2pix and pix2pixHD [39], augments cGANs with additional VAE encoders to shape the latent space such that diversity is ensured.

Instead of enforcing bijectivity through cycle losses, invertible neural networks are bidirectional by design, since encoder and generator are realized by forward and backward processing within a single bijective model. We focus on architectures whose forward and backward pass require the same computational effort. The coupling layer designs pioneered by NICE [7] and RealNVP [8] emerged as very powerful and flexible model families under this restriction. Using additive coupling layers, i-RevNets [18] demonstrated that the lack of information reduction from data space to latent space does not cause overfitting. The Glow architecture [21] combines affine coupling layers with invertible 1x1 convolutions and achieves impressive attribute manipulations (e.g. age, hair color) in generated faces images. This approach was recently generalized to video [26].

Thanks to tractable Jacobian determinants, the coupling layer architecture enables maximum likelihood training [7, 8], but experimental comparisons with other training methods are inconclusive so far. For instance, [5] found minimization of an adversarial loss to be superior to maximum likelihood training in RealNVPs, [35] trained i-RevNets in the same manner as adversarial auto-encoders, i.e. with a discriminator acting in latent rather than data space, and Flow-GANs [11] performed best using bidirectional training, a combination of maximum likelihood and adversarial loss. On the other hand, maximum likelihood training worked well within Glow [21], and i-ResNets [2] could even be trained with approximated Jacobian determinants. In this work we reinforce the view that high-quality generative models can be trained by maximum likelihood loss alone. To the best of our knowledge, we are the first to apply the coupling layer design for conditional generative models, with the exception of [1], who use it to compute posteriors for (relatively small) inverse problems, but do not consider image generation.

Colorization. State-of-the-art regression models for colorization produce visually near-perfect images [16], but do not account for the ambiguity inherent in this inverse problem. To address this, models would ideally define a conditional distribution of plausible color images for a given grayscale input, instead of just returning a single “best” solution.

Popular existing approaches for diverse colorization predict per-pixel color histograms from a U-Net [41] or from hypercolumns of an adapted VGG network [27]. However, sampling from these local histograms independently can not lead to a spatially consistent colorization, requiring additional heuristic post-processing steps to avoid artefacts.

In terms of generative models, both VAEs [6] and cGANs [17, 4] have been proposed for the task. However, their solutions do not reach the quality of the regression-based models, and cGANs in particular often lack diversity. To compensate, modifications and extensions to generative approaches have been developed, such as auto-regressive models [12] and CRFs [33]. However, these methods are computationally very expensive and often unable to scale to realistic image sizes.

Conceptually closest to our proposed method is the work of [38], where an encoder network maps color information to a latent space and a generator network learns the inverse transform, both conditioned on the grayscale image. Their experiments however are limited to a data set with only cars, and just three latent dimensions, leading to global, but no local diversity.

In contrast to the above, our flow-based cINN generates diverse colorizations in one standard feed-forward pass. It models the distribution of all pixels jointly, and allows for meaningful latent space manipulations.

3 Method

Our method is an extension of the affine coupling block architecture established in [8]. There, each network block splits its input 𝐮 into two parts [𝐮1,𝐮2] and applies affine transformations between them that have strictly upper or lower triangular Jacobians:

𝐯1 =𝐮1exp(s1(𝐮2))+t1(𝐮2) (1)
𝐯2 =𝐮2exp(s2(𝐯1))+t2(𝐯1).

The outputs [𝐯1,𝐯2] are concatenated again and passed to the next coupling block. The internal functions sj and tj can be represented by arbitrary neural networks, and are only ever evaluated in the forward direction, even when the coupling block is inverted:

𝐮2 =(𝐯2-t2(𝐯1))exp(s2(𝐯1)) (2)
𝐮1 =(𝐯1-t1(𝐮2))exp(s1(𝐮2)).

As shown in [8], the logarithm of the Jacobian determinant for such a coupling block is simply the sum of s1 and s2 over image dimensions.

{tikzpicture}

[ every node/.style = inner sep = 0pt, outer sep = 0pt, anchor = center, align = center, font = , text = black!75, var/.style = rectangle, minimum width = 2em, minimum height = 2em, text depth = 0, line width = 1pt, draw = black!50, fill = black!5, op/.style = circle, minimum width = 2em, text depth = 2pt, line width = 1pt, draw = black, fill = vll-dark, text = black!5, font = , nn/.style = op, regular polygon, regular polygon sides = 6, minimum width = 1cm, dot/.style = circle, minimum width = 5pt, fill = vll-dark, connect/.style = line width = 1pt, draw = vll-dark, arrow/.style = connect, -Triangle[length=6pt, width=4pt]]

\node

[anchor = east] (in) at (0,0) in;

\node

[dot] (split) at ([shift = (in.east)] 0:0.9) ; \draw[connect, shorten < = 3pt] (in) to (split);

\node

[var] (u1) at ([shift = (split)] 60:1.5) 𝐮1; \draw[connect] (split) to (u1);

\node

[var] (u2) at ([shift = (split)] -60:1.5) 𝐮2; \draw[connect] (split) to (u2);

\node

[dot] (d1) at ([shift = (u2)] 0:1) ; \node[dot] (d2) at ([shift = (d1)] 0:1.2) ;

\node

[op] (mult2) at ([shift = (d1)] 60:3) ; \draw[arrow] (d1) to (mult2); \draw[arrow] (u1) to (mult2);

\node

[op] (add2) at ([shift = (d2)] 60:3) +; \draw[arrow] (d2) to (add2); \draw[arrow] (mult2) to (add2);

\node

[var] (v1) at ([shift = (add2)] 0:1.3) 𝐯1; \draw[connect] (add2) to (v1);

\node

[dot] (d3) at ([shift = (v1)] 0:1.3) ; \node[dot] (d4) at ([shift = (d3)] 0:1.2) ;

\node

[op] (mult1) at ([shift = (d3)] -60:3) ; \draw[arrow] (d3) to (mult1); \draw[arrow] (u2) to (mult1);

\node

[op] (add1) at ([shift = (d4)] -60:3) +; \draw[arrow] (d4) to (add1); \draw[arrow] (mult1) to (add1);

\node

[var] (v2) at ([shift = (add1)] 0:1.3) 𝐯2; \draw[connect] (add1) to (v2);

\node

[dot] (cat) at ([shift = (v2)] 60:1.5) ; \draw[connect] (v2) to (cat); connect](v1)--(v1.center-|v2.center)--(cat);\nodeanchor = west] (out) at ([shift = (cat)] 0:1) out; \draw[arrow, shorten > = 3pt] (cat) to (out);

\node

[nn] (s2) at ([shift = (d1)] 60:1.5) s1; \node[nn] (t2) at ([shift = (d2)] 60:1.5) t1; \node[nn] (s1) at ([shift = (d3)] -60:1.5) s2; \node[nn] (t1) at ([shift = (d4)] -60:1.5) t2;

\node

[var] (cnet) at ([shift = (v1)] 90:1.5) 𝐜; \draw[arrow, densely dotted] ([shift = (cnet.west)] 90:0.05) – ([shift = (cnet -| s2.north)] 90:0.05) – (s2.north); \draw[arrow, densely dotted] ([shift = (cnet.west)] -90:0.05) – ([shift = (cnet -| t2.north)] -90:0.05) – (t2.north); \draw[arrow, densely dotted] ([shift = (cnet.east)] -90:0.05) – ([shift = (cnet -| s1.north)] -90:0.05) – (s1.north); \draw[arrow, densely dotted] ([shift = (cnet.east)] 90:0.05) – ([shift = (cnet -| t1.north)] 90:0.05) – (t1.north);

\coordinate

(top left) at ((u1.northwest)+(-2em,1em)); \coordinate(bottom right) at ((v2.southeast)+(2em,-1em)); {scope}[on background layer] \node[rounded corners = 3pt, fill = black!1, draw = black!33, fit = (top left) (bottom right)] (bg) ; \node[font = , color = black!25, scale=1.3] (CC) at ([shift=(v1)] -90:1.7) CC;

Figure 2: One conditional affine coupling block (CC).

3.1 Conditional invertible transformations

We adapt the design of creftypeplural 2\crefpairconjunction1 to produce a conditional version of the coupling block. Because the subnetworks sj and tj are never inverted, we can concatenate conditioning data 𝐜 to their inputs without losing the invertibility, replacing s1(𝐮2) with s1(𝐮2,𝐜) etc. Our conditional coupling block design is illustrated in creftype 2.

In general, we will refer to a cINN with network parameters θ as f(𝐱;𝐜,θ), and the inverse as g(𝐳;𝐜,θ). For any fixed condition 𝐜, the invertibility is given as

f-1(;𝐜,θ)=g(;𝐜,θ). (3)

3.2 Maximum likelihood training of cINNs

By prescribing a probability distribution pZ(𝐳) on latent space Z, the model f assigns any input 𝐱 a probability, dependent on both the network parameters θ and the conditioning 𝐜, through the change-of-variables formula:

pX(𝐱;𝐜,θ)=pZ(f(𝐱;𝐜,θ))|det(f𝐱)|. (4)

Here, we use the Jacobian matrix f/𝐱. We will denote the determinant of the Jacobian, evaluated at some training sample 𝐱i, as Jidet(f/𝐱|𝐱i). Bayes’ theorem gives us the posterior over model parameters as p(θ;𝐱,𝐜)pX(𝐱;𝐜,θ)pθ(θ). Our goal is to find network parameters that maximize its logarithm, i.e. we minimize the loss

=𝔼i[-log(pX(𝐱i;𝐜i,θ))]-log(pθ(θ)), (5)

which is the same as in classical Bayesian model fitting.

Inserting creftype 4 with a standard normal distribution for pZ(𝐳), as well as a Gaussian prior on the weights θ with 1/2σθ2τ, we obtain

=𝔼i[f(𝐱i;𝐜i,θ)222-log|Ji|]+τθ22. (6)

The latter term represents L2 weight regularization, while the former is the maximum likelihood loss.

Training a network with this loss yields an estimate of the maximum likelihood network parameters θ^ML. From there, we can perform conditional generation for a fixed 𝐜 by sampling 𝐳 and using the inverted network g: 𝐱gen=g(𝐳;𝐜,θ^ML), with 𝐳pZ(𝐳).

Training with the maximum likelihood method makes it virtually impossible for mode collapse to occur: If any mode in the training set has low probability under the current guess pX(𝐱;𝐜,θ), the corresponding latent vectors will lie far outside the normal distribution pZ and receive big loss from the first L2-term in creftype 6. In contrast, the discriminator of a GAN only supplies a weak signal, proportional to the mode’s relative frequency in the training data, so that the generator is not penalized much for ignoring a mode completely.

3.3 Conditioning network

In complex settings, we expect that higher-level features of 𝐜 need to be extracted for the conditioning to be effective, e.g. global semantic information from an image as in creftype 4.2. In such cases, feeding the condition 𝐜 directly into the cINN would place an unreasonable burden on the s and t networks, as higher-level features would have to be re-learned in each coupling block.

To address this issue, we introduce an additional feed-forward conditioning network h, which transforms the condition 𝐜 to some intermediate representation 𝐜~=h(𝐜), and replace 𝐜i in creftype 6 with 𝐜~i=h(𝐜i). The network h can be pretrained, e.g. by using features from a VGG architecture trained for image classification. Alternatively or additionally, h can be trained jointly with the cINN by propagating gradients from the maximum likelihood loss through the conditioning 𝐜~. In this case, the conditioning network will learn to extract features which are particularly useful for embedding the cINN inputs 𝐱 into latent variables 𝐳.

3.4 Important details

For cINNs to match the performance of well-established architectures for conditional generation, we introduce a number of minor modifications and adjustments to the architecture and training procedure. With these adaptions, our training setup is very stable and converges every time. Ablation results are presented in Sec. 4.4.

Noise as data augmentation. We add a small amount of noise to the inputs 𝐱 as part of the standard data augmentation. This helps to smooth out quantization artifacts in the input, and prevents sparse gradients when large parts of the image are completely flat (as e.g. in MNIST).

Soft clamping of scale coefficients. We apply an additional nonlinear function to the scale coefficients s, of the form

sclamp=2απarctan(sα), (7)

which yields sclamps for |s|α and sclamp±α for |s|α. This prevents any instabilities stemming from exploding magnitude of the exponential exp(sclamp). We find α=1.9 to be a good value for most architectures.

Initialization. Heuristically, we find that Xavier initialization [10] leads to stable training from the start. We experienced training instability when initial parameter values were too high. Similar to [21], we also initialize the last convolution in all s and t subnetworks to zero, so training starts from an identity transform.

Soft channel permutations. We use random orthogonal matrices to mix the information between the channels. This allows for more interaction between the two information streams 𝐮1,𝐮2 in the coupling blocks. A similar technique was used in [21], but our matrices stay fixed throughout training and are guaranteed to be cheaply invertible.

\tikzstyle

box = [rectangle, inner sep=0pt, outer sep=0pt, align=center, minimum width=1cm, minimum height=1cm, text depth=0, line width=1pt, draw=black, fill=white, font=, text=vll-dark] \tikzstyledark = [fill=vll-dark, text=white] \tikzstylebrace = [decoration=brace, mirror, raise=0mm, amplitude=2mm, decorate, black!50] \tikzstylebracelabel = [below=1mm, text=vll-dark, font=]

{tikzpicture}[baseline=-0.65ex]\node[box,fill=vll-green](1)at(-0.5,0.5)1;\node[box,fill=red!50](2)at(0.5,0.5)2;\node[box,fill=yellow!50](3)at(-0.5,-0.5)3;\node[box,fill=cyan!25](4)at(0.5,-0.5)4;\draw[brace](-1,-2)--node[bracelabel]c×2×2(1,-2);=({tikzpicture}[baseline=-0.65ex]\node[box](1)at(-0.5,0.5)12;\node[box](2)at(0.5,0.5)12;\node[box](3)at(-0.5,-0.5)12;\node[box](4)at(0.5,-0.5)12;\draw[brace](-1,-2)--node[bracelabel]average(1,-2);\tikz[baseline=-0.65ex]\node[innersep=0]at(0,-1),;{tikzpicture}[baseline=-0.65ex]\node[box](1)at(-0.5,0.5)12;\node[box,dark](2)at(0.5,0.5)12;\node[box](3)at(-0.5,-0.5)12;\node[box,dark](4)at(0.5,-0.5)12;\draw[brace](-1,-2)--node[bracelabel]horizontal(1,-2);\tikz[baseline=-0.65ex]\node[innersep=0]at(0,-1),;{tikzpicture}[baseline=-0.65ex]\node[box](1)at(-0.5,0.5)12;\node[box](2)at(0.5,0.5)12;\node[box,dark](3)at(-0.5,-0.5)12;\node[box,dark](4)at(0.5,-0.5)12;\draw[brace](-1,-2)--node[bracelabel]vertical(1,-2);\tikz[baseline=-0.65ex]\node[innersep=0]at(0,-1),;{tikzpicture}[baseline=-0.65ex]\node[box](1)at(-0.5,0.5)12;\node[box,dark](2)at(0.5,0.5)12;\node[box,dark](3)at(-0.5,-0.5)12;\node[box](4)at(0.5,-0.5)12;\draw[brace](-1,-2)--node[bracelabel]diagonal(1,-2);){tikzpicture}[baseline=-0.65ex]\node[box,fill=orange!50](1)at(0.3,1.2)a;\node[box,fill=black!11](2)at(0.1,0.4)h;\node[box,fill=black!8](3)at(-0.1,-0.4)v;\node[box,fill=black!5](4)at(-0.3,-1.2)d;\draw[brace](-0.9,-2)--node[bracelabel]4c×1×1(0.9,-2);

Figure 3: Haar wavelet downsampling reduces spatial dimensions & separates lower frequencies (a) from high (h,v,d).

Haar wavelet downsampling. All prior INN architectures use checkerboard patterns for reshaping to lower spatial resolutions. We find it helpful to instead perform downsampling with Haar wavelets [13], which essentially decompose images into an average pooling channel as well as vertical, horizontal and diagonal derivatives, see creftype 3. The three derivative channels contain high resolution information which we can split off early, transforming only the remaining information further in later stages of the cINN. This also contributes to mixing the variables between layers, complementing the soft permutations.

Figure 4: Axes in our MNIST model’s latent space, which linearly encode the style attributes width, thickness and slant.

4 Experiments

We present results and explore the latent space of our models for two conditional image generation tasks: MNIST digit generation and image colorization.

4.1 Class-conditional generation for MNIST

{tikzpicture}

[ every node/.style = inner sep = 0pt, outer sep = 0pt, anchor = center, align = center, font = , text = black!75, var/.style = rectangle, minimum width = 2em, minimum height = 2em, inner sep = 5pt, text depth = 0, line width = 1pt, draw = black!50, fill = black!5, op/.style = circle, minimum width = 2em, text depth = 2pt, line width = 1pt, draw = black, fill = vll-dark, text = black!5, font = , block/.style = rectangle, rounded corners = 3pt, rotate=90, minimum width = 6em, minimum height = 1em, inner sep = 4pt, text depth = 0, line width = 1pt, draw = black!33, fill = black!1, font = , text = black!25, dot/.style = circle, minimum width = 3pt, fill = vll-dark, connect/.style = line width = 1pt, draw = vll-dark, dottedarrow/.style = connect, -Triangle[length=6pt, width=4pt], densely dotted, doublearrow/.style = connect, Triangle[length=6pt, width=4pt]-Triangle[length=6pt, width=4pt]]

\node

[anchor = east, draw = vll-dark, line width = 2pt] (in) at (0,0) ; \node[anchor=north west, xshift=1mm, yshift=-2mm, scale=0.75] at (in.south west) 28×28; \node[anchor=east, xshift=-3mm, yshift=-0.5mm, scale=2] at (in.west) 𝐱;

\node

[block] (cc1) at ([shift = (in.east)] 0:1.3) CC
fully connected; \draw[doublearrow] (in) to (cc1);

\node

[block] (cc2) at ([shift = (cc1)] 0:1.7) CC
fully connected; \draw[doublearrow] (cc1) to (cc2);

\node

[block, draw = none, fill = none, text = vll-dark] (dots) at ([shift = (cc2)] 0:1.7) ; \draw[doublearrow] (cc2) to (dots);

\node

[block] (cc3) at ([shift = (dots)] 0:1.7) CC
fully connected; \draw[doublearrow] (dots) to (cc3);

\node

[anchor = west, draw = vll-dark, line width = 2pt] (out) at ([shift = (cc3)] 0:1.3) ; \draw[doublearrow] (cc3) to (out); \node[anchor=north east, xshift=0mm, yshift=-2mm, scale=0.75, align=right] at (out.south east) 784
(28×28); \node[anchor=west, xshift=3mm, yshift=-0.5mm, scale=2] at (out.east) 𝐳;

\draw

[decoration = brace, mirror, raise = 1mm, amplitude = 3mm, decorate, black!50] ([xshift=-1mm]cc1.north west) – node [below=6mm, text=black!50] 24 blocks ([xshift=1mm]cc3.south west);

\coordinate

(mid) at (0.5*(in.east)+0.5*(out.west)); \node[rectangle, fill = black!5, draw = black!50, inner sep = 5pt, rounded corners = 5pt] (c) at ([shift = (mid)] 90:2.35) [0001000000]; \node[anchor=south, yshift=2.5mm, scale=2] at (c.north) 𝐜 \adjustboxscale=0.5as one-hot vector;

\coordinate

(h1) at ([xshift=-2mm, yshift=-2mm]c.south); \draw[dottedarrow] (c.south -| h1) – (h1) – (h1 -| cc1) – (cc1); \coordinate(h2) at ([xshift=0mm, yshift=-3mm]c.south); \draw[dottedarrow] (c.south -| h2) – (h2) – (h2 -| cc2) – (cc2); \coordinate(h3) at ([xshift=2mm, yshift=-2mm]c.south); \draw[dottedarrow] (c.south -| h3) – (h3) – (h3 -| cc3) – (cc3);

Figure 5: cINN model for conditional MNIST generation.

As a first experiment, we perform simple class-conditional generation of MNIST digits. We construct a cINN of 24 coupling blocks using fully connected subnetworks s and t, which receive the conditioning directly as a one-hot vector (creftype 5). No conditioning network h is used. For data augmentation we only add a small amount of noise to the images (σ=0.02), as described in creftype 3.4.

Samples generated by the model are shown in creftype 6. We find that the cINN learns latent representations that are shared across conditions 𝐜. Keeping the latent vector 𝐳 fixed while varying 𝐜 produces different digits in the same style. This property, in conjunction with our network’s invertibility, can directly be used for style transfer, as demonstrated in creftype 7. This outcome is not obvious – the trained cINN could also decompose into 10 essentially separate subnetworks, one for each condition. In this case, the latent space of each class would be structured differently, and inter-class transfer of latent vectors would be meaningless. The structure of the latent space is further illustrated in creftype 4, where we identify three latent axes with interpretable meanings. Note that while the latent space is learned without supervision, we found the axes in a semi-automatic fashion: We perform PCA on the latent vectors of the test set, without the noise augmentation, and manually identify meaningful directions in the subspace of the first four principal components.

Tidy
Slanted, narrow
Slanted left, wide
Messy
Faint
Bold
Figure 6: MNIST samples from our cINN conditioned on digit labels. All ten digits within one row (0,,9) were generated using the same latent code 𝐳, but changing condition 𝐜. We see that each 𝐳 encodes a single style consistently across digits, while varying 𝐳 between rows leads to strong differences in writing style.
Figure 7: To perform style transfer, we determine the latent code 𝐳=f(𝐱;𝐜,θ) of a validation image (left), then use the inverse network g=f-1 with different conditions 𝐜^ to generate the other digits in the same style, 𝐱^=g(𝐳;𝐜^,θ).
{tikzpicture}

[ every node/.style = inner sep = 0pt, outer sep = 0pt, anchor = center, align = center, font = , text = black!75, var/.style = rectangle, minimum width = 2em, minimum height = 2em, inner sep = 5pt, text depth = 0, line width = 1pt, draw = black!50, fill = black!5, op/.style = circle, minimum width = 2em, text depth = 2pt, line width = 1pt, draw = black, fill = vll-dark, text = black!5, font = , block1/.style = rectangle, rounded corners = 3pt, minimum width = 3.5em, minimum height = 6em, inner sep = 4pt, text depth = 0, line width = 1pt, draw = black!33, fill = black!1, font = , text = black!25, align = center, block2/.style = block1, minimum height = 5em, block3/.style = block1, minimum height = 4em, block4/.style = block1, minimum height = 3em, block_fc/.style = block1, minimum height = 6.5em, minimum width = 4pt, inner sep = 0pt, nn/.style = op, regular polygon, regular polygon sides = 6, minimum width = 1cm, mininn/.style = nn, fill=vll-dark!90, line width=0.5pt, minimum width = 3mm, dot/.style = circle, minimum width = 3pt, fill = vll-dark, connect/.style = line width = 1pt, draw = vll-dark, dottedarrow/.style = connect, -Triangle[length=6pt, width=4pt], densely dotted, shorten > = 2pt, doublearrow/.style = connect, Triangle[length=6pt, width=4pt]-Triangle[length=6pt, width=4pt]]

\node

[anchor = east, draw = vll-dark, line width = 1pt] (a) at (-2mm,2mm) \adjincludegraphics[height=11mm, trim=0 .250pt 0 .50pt, clip]figures/Mandrill-yuv.jpg; \node[anchor=south west, xshift=1mm, yshift=1mm, scale=0.5] at (a.north west) 64×64; \node[anchor=south east, xshift=-1.5mm, yshift=-0mm, scale=1.2] at (a.south west) 𝐚; \node[anchor = east, draw = vll-dark, line width = 1pt] (b) at (2mm,-2mm) \adjincludegraphics[height=11mm, trim=0 0 0 .750pt, clip]figures/Mandrill-yuv.jpg; \node[anchor=north east, xshift=-1mm, yshift=-1mm, scale=0.5] at (b.south east) 64×64; \node[anchor=south east, xshift=-1.5mm, yshift=-1mm, scale=1.2] at (b.south west) 𝐛; \coordinate(inmid) at (0.5*(a)+0.5*(b)); \coordinate(in) at (inmid -| b.east);

\node

[block1] (cc11) at ([xshift=12mm, yshift=0.9mm]in) CC
conv; \node[block1] (cc12) at ([xshift=18mm, yshift=0.3mm]in) CC
conv; \node[block1] (cc13) at ([xshift=24mm, yshift=-0.3mm]in) CC
conv; \node[block1] (cc14) at ([xshift=30mm, yshift=-0.9mm]in) CC
conv; \draw[doublearrow] (in) to (in -| cc11.west); \coordinate(cc1) at (0.5*(cc11)+0.5*(cc14)); \coordinate(cc1out) at (in -| cc14.east); \node[anchor=north west, xshift=1mm, yshift=-1.5mm, scale=0.5] at (cc11.south west) 2×64×64;

\node

[block2] (cc21) at ([xshift=12mm, yshift=1.5mm]cc1out) CC
conv; \node[block2] (cc22) at ([xshift=18mm, yshift=0.9mm]cc1out) CC
conv; \node[block2] (cc23) at ([xshift=24mm, yshift=0.3mm]cc1out) CC
conv; \node[block2] (cc24) at ([xshift=30mm, yshift=-0.3mm]cc1out) CC
conv; \node[block2] (cc25) at ([xshift=36mm, yshift=-0.9mm]cc1out) CC
conv; \node[block2] (cc26) at ([xshift=42mm, yshift=-1.5mm]cc1out) CC
conv; \draw[doublearrow] (cc1out) to (in -| cc21.west); \coordinate(cc2) at (0.5*(cc21)+0.5*(cc26)); \coordinate(cc2out) at (in -| cc26.east); \node[anchor=north west, xshift=1mm, yshift=-1.5mm, scale=0.5] at (cc21.south west) 8×32×32;

\node

[block3] (cc31) at ([xshift=12mm, yshift=1.5mm]cc2out) CC
conv; \node[block3] (cc32) at ([xshift=18mm, yshift=0.9mm]cc2out) CC
conv; \node[block3] (cc33) at ([xshift=24mm, yshift=0.3mm]cc2out) CC
conv; \node[block3] (cc34) at ([xshift=30mm, yshift=-0.3mm]cc2out) CC
conv; \node[block3] (cc35) at ([xshift=36mm, yshift=-0.9mm]cc2out) CC
conv; \node[block3] (cc36) at ([xshift=42mm, yshift=-1.5mm]cc2out) CC
conv; \draw[doublearrow] (cc2out) to (in -| cc31.west); \coordinate(cc3) at (0.5*(cc31)+0.5*(cc36)); \coordinate(cc3out) at (in -| cc36.east); \node[anchor=north west, xshift=1mm, yshift=-1.5mm, scale=0.5] at (cc31.south west) 16×16×16;

\node

[block4] (cc41) at ([xshift=12mm, yshift=1.5mm]cc3out) CC
conv; \node[block4] (cc42) at ([xshift=18mm, yshift=0.9mm]cc3out) CC
conv; \node[block4] (cc43) at ([xshift=24mm, yshift=0.3mm]cc3out) CC
conv; \node[block4] (cc44) at ([xshift=30mm, yshift=-0.3mm]cc3out) CC
conv; \node[block4] (cc45) at ([xshift=36mm, yshift=-0.9mm]cc3out) CC
conv; \node[block4] (cc46) at ([xshift=42mm, yshift=-1.5mm]cc3out) CC
conv; \draw[doublearrow] (cc3out) to (in -| cc41.west); \coordinate(cc4) at (0.5*(cc41)+0.5*(cc46)); \coordinate(cc4out) at (in -| cc46.east); \node[anchor=north west, xshift=1mm, yshift=-1.5mm, scale=0.5] at (cc41.south west) 32×8×8;

\node

[block_fc] (cc51) at ([xshift=7mm]cc4out) ; \node[block_fc] (cc52) at ([xshift=11mm]cc4out) ; \node[block_fc] (cc53) at ([xshift=15mm]cc4out) ; \node[block_fc] (cc54) at ([xshift=19mm]cc4out) ; \node[block_fc] (cc55) at ([xshift=23mm]cc4out) ; \node[block_fc] (cc56) at ([xshift=27mm]cc4out) ; \node[block_fc] (cc57) at ([xshift=31mm]cc4out) ; \node[block_fc] (cc58) at ([xshift=35mm]cc4out) ; \draw[doublearrow] (cc4out) to (in -| cc51.west); \coordinate(cc5) at (0.5*(cc51)+0.5*(cc58)); \coordinate(cc5out) at (in -| cc58.east); \node[block4, minimum width = 4em, opacity = 0.67, text opacity = 1, text = black!25] at (cc5) CC
fully connected; \node[anchor=south east, xshift=-1mm, yshift=-1mm, scale=0.5, rotate=90] at (cc51.north west) 512;

\node

[anchor = west, draw = vll-dark, line width = 1pt] (out) at ([xshift=6mm]cc5out) ; \draw[doublearrow] (cc5out) to (out); \node[anchor=south east, xshift=0mm, yshift=1mm, scale=0.5, align=right] at (out.north east) 8192
(2×64×64); \node[anchor=west, xshift=2mm, yshift=-0.5mm, scale=1.5] at (out.east) 𝐳;

\coordinate

(z1) at ([xshift=2mm, yshift=-11mm]out.south); \draw[doublearrow] (out.south -| z1) – (z1) – (z1 -| cc26) – (cc26.south); \node[anchor=south west, xshift=1mm, yshift=0.5mm, scale=0.5] at (cc26 |- z1) 16×16×16; \coordinate(z2) at ([xshift=0mm, yshift=-10mm]out.south); \draw[doublearrow] (out.south -| z2) – (z2) – (z2 -| cc36) – (cc36.south); \node[anchor=south west, xshift=1mm, yshift=0.5mm, scale=0.5] at (cc36 |- z2) 32×8×8; \coordinate(z3) at ([xshift=-2mm, yshift=-9mm]out.south); \draw[doublearrow] (out.south -| z3) – (z3) – (z3 -| cc46) – (cc46.south); \node[anchor=south west, xshift=1mm, yshift=0.5mm, scale=0.5] at (cc46 |- z3) 96×4×4;

\coordinate

(mid) at (0.5*(in)+0.5*(out.west)); \node[nn, scale=1.5] (c) at ([shift = (mid)] 90:3.05) h; \node[anchor = east, draw = vll-dark, line width = 1pt] (L) at ([xshift=-6mm]c.west) \adjincludegraphics[height=11mm, trim=0 0.50pt 0 .250pt, clip]figures/Mandrill-yuv.jpg; \draw[dottedarrow, shorten > = 0pt] (L) – (c); \node[anchor=south west, xshift=1mm, yshift=1mm, scale=0.5] at (L.north west) 256×256; \node[anchor=east, xshift=-2mm, yshift=0.5mm, scale=1.5] at (L.west) 𝐋;

\coordinate

(h1) at ([xshift=-3mm, yshift=-4mm]c.south); \draw[dottedarrow] (c.south -| h1) – (h1) – (h1 -| cc11) – (cc11.north); \node[mininn] at ([yshift=-3mm]h1 -| cc11.north) ; \node[anchor=south, scale=0.8] at ([yshift=1.5mm]h1 -| cc11.north) h1; \draw[dottedarrow] (h1 -| cc12) – (cc12.north); \node[mininn] at ([yshift=-3mm]h1 -| cc12.north) ; \node[anchor=south, scale=0.8] at ([yshift=1.5mm]h1 -| cc12.north) h2; \draw[dottedarrow] (h1 -| cc13) – (cc13.north); \node[mininn] at ([yshift=-3mm]h1 -| cc13.north) ; \node[anchor=south, scale=0.8] at ([yshift=1.5mm]h1 -| cc13.north) h3; \draw[dottedarrow] (h1 -| cc14) – (cc14.north); \node[mininn] at ([yshift=-3mm]h1 -| cc14.north) ; \node[anchor=south, scale=0.8] at ([yshift=1.5mm]h1 -| cc14.north) ; \coordinate(h2) at ([xshift=-1.5mm, yshift=-5mm]c.south); \draw[dottedarrow] (c.south -| h2) – (h2) – (h2 -| cc21) – (cc21.north); \node[mininn] at ([yshift=-3mm]h2 -| cc21.north) ; \draw[dottedarrow] (h2 -| cc22) – (cc22.north); \node[mininn] at ([yshift=-3mm]h2 -| cc22.north) ; \draw[dottedarrow] (h2 -| cc23) – (cc23.north); \node[mininn] at ([yshift=-3mm]h2 -| cc23.north) ; \draw[dottedarrow] (h2 -| cc24) – (cc24.north); \node[mininn] at ([yshift=-3mm]h2 -| cc24.north) ; \draw[dottedarrow] (h2 -| cc25) – (cc25.north); \node[mininn] at ([yshift=-3mm]h2 -| cc25.north) ; \draw[dottedarrow] (h2 -| cc26) – (cc26.north); \node[mininn] at ([yshift=-3mm]h2 -| cc26.north) ; \coordinate(h3) at ([xshift=0mm, yshift=-6mm]c.south); \draw[dottedarrow] (c.south -| h3) – (h3) – (h3 -| cc31) – (cc31.north); \node[mininn] at ([yshift=-3mm]h3 -| cc31.north) ; \draw[dottedarrow] (h3 -| cc32) – (cc32.north); \node[mininn] at ([yshift=-3mm]h3 -| cc32.north) ; \draw[dottedarrow] (h3 -| cc33) – (cc33.north); \node[mininn] at ([yshift=-3mm]h3 -| cc33.north) ; \draw[dottedarrow] (h3 -| cc34) – (cc34.north); \node[mininn] at ([yshift=-3mm]h3 -| cc34.north) ; \draw[dottedarrow] (h3 -| cc35) – (cc35.north); \node[mininn] at ([yshift=-3mm]h3 -| cc35.north) ; \draw[dottedarrow] (h3) – (h3 -| cc36) – (cc36.north); \node[mininn] at ([yshift=-3mm]h3 -| cc36.north) ; \coordinate(h4) at ([xshift=1.5mm, yshift=-5mm]c.south); \draw[dottedarrow] (h4 -| cc41) – (cc41.north); \node[mininn] at ([yshift=-3mm]h4 -| cc41.north) ; \draw[dottedarrow] (h4 -| cc42) – (cc42.north); \node[mininn] at ([yshift=-3mm]h4 -| cc42.north) ; \draw[dottedarrow] (h4 -| cc43) – (cc43.north); \node[mininn] at ([yshift=-3mm]h4 -| cc43.north) ; \draw[dottedarrow] (h4 -| cc44) – (cc44.north); \node[mininn] at ([yshift=-3mm]h4 -| cc44.north) ; \draw[dottedarrow] (h4 -| cc45) – (cc45.north); \node[mininn] at ([yshift=-3mm]h4 -| cc45.north) ; \draw[dottedarrow] (c.south -| h4) – (h4) – (h4 -| cc46) – (cc46.north); \node[mininn] at ([yshift=-3mm]h4 -| cc46.north) ; \coordinate(h5) at ([xshift=3mm, yshift=-4mm]c.south); \coordinate(cc4cc5mid) at (0.5*(cc46.north)+0.5*(cc51.north)); \node[mininn] at (h5 -| cc4cc5mid) ; \draw[dottedarrow] (h5 -| cc51) – (cc51.north); \draw[dottedarrow] (h5 -| cc52) – (cc52.north); \draw[dottedarrow] (h5 -| cc53) – (cc53.north); \draw[dottedarrow] (h5 -| cc54) – (cc54.north); \draw[dottedarrow] (h5 -| cc55) – (cc55.north); \draw[dottedarrow] (h5 -| cc56) – (cc56.north); \draw[dottedarrow] (h5 -| cc57) – (cc57.north); \draw[dottedarrow] (c.south -| h5) – (h5) – (h5 -| cc58) – (cc58.north);

Figure 8: cINN model for diverse colorization. The conditioning network h consists of a truncated VGG [36] pretrained to predict colors on ImageNet, with separate convolutional heads h1,h2,h3, tailoring the extracted features to each individual conditional coupling block (CC). After each group of coupling blocks, we apply Haar wavelet downsampling (creftype 3) to reduce the spatial dimensions and, where indicated by arrows, split off parts of the latent code 𝐳 early.

4.2 Diverse ImageNet colorization

For a more challenging task, we turn to colorization of natural images. The common approach for this task is to represent images in Lab color space and generate color channels 𝐚,𝐛 by a model conditioned on the luminance channel 𝐋.

We train on the ImageNet dataset [34], again adding low noise to the 𝐚,𝐛 channels (σ=0.05). As the color channels do not require as much resolution as the luminance channel, we condition on 256×256 pixel grayscale images, but generate 64×64 pixel color information. This is in accordance with the majority of existing colorization methods.

As with most generative INN architectures, we do not keep the resolution and channels fixed throughout the network, for the sake of computational cost. Instead, we use 4 resolution stages, as illustrated in creftype 8. At each stage, the data is reshaped to a lower resolution and more channels, after which a fraction of the channels are split off as one part of the latent code. As the high resolution stages have a smaller receptive field and less expressive power, the corresponding parts of the latent vector encode local structures and noise. More global information is passed on to the lower resolution sections of the cINN.

For the conditioning network h, we start with the same VGG-like architecture and pretraining as [41], i.e. we pre-train the network to classify each pixel of the gray image into color bins. By cutting off the network before the second-to-last convolution, we extract 256 feature maps of size 64×64 from the grayscale image 𝐋. We then add independent heads on top of this for each conditional coupling block in the cINN, indicated by small hexagons in creftype 8. Thus each coupling block k receives its own specialized conditioning 𝐜~i(k)=hk(h(𝐜i)). Each head consists of up to five strided convolutions, depending on its required output resolution, and a batch normalization layer. The ablation study in creftype 16 confirms that the conditioning network is necessary to capture semantic information.

We initially train the cINN and the hk, keeping the parameters of the conditioning network h fixed, for 30 000 iterations. After this, we train both jointly until convergence, for 3 days on 3 Nvidia GTX1080 GPUs. The Adam optimizer is essential for fast convergence, and we lower the learning rate when the maximum likelihood loss levels off.

At inference time, we use joint bilateral upsampling [25] to match the resolution of the generated color channels 𝐚^,𝐛^ to that of the luminance channel 𝐋. This produces visually slightly more pleasing edges than bicubic upsampling, but has little to no impact on the results. It was not used in the quantitative results table, to ensure an unbiased comparison.

The cINN compares favourably to existing methods, as shown in creftype 1, and has the best diversity and best-of-8 accuracy of the compared methods. The cGAN apparently ignores the latent code, and relies only on the condition. As a result, we do not measure any significant diversity, in line with results from [17].

In terms of FID score, the cGAN performs best, although its results do not appear more realistic to the human eye, cf. creftype 13. This may be due to the fact that FID is sensitive to outliers, which are unavoidable for a truly diverse method (see creftype 12), or because the discriminator loss implicitly optimizes for the similarity of deep CNN activations. The VGG classification accuracy of generative methods is decreased compared to CNN, because occasional outliers may lead to misclassification. Latent space interpolations and color transfer are shown in creftypeplural 15\crefpairconjunction14.

4.3 Diverse bedrooms colorization

To provide a simpler model for more in-depth experiments and ablations, we additionally train a cINN for colorization on the LSUN bedrooms dataset [40]. We use a smaller model than for ImageNet, and train the conditioning network jointly from scratch, without pretraining. Both the conditioning input, as well as the generated color channels have a resolution of 64×64 pixels. The entire model trains in under 4 hours on a single GTX 1080Ti GPU.

To our knowledge, the only diversity-enforcing cGAN architecture previously used for colorization is the colorGAN [4], which is also trained exclusively on the bedrooms dataset. Training the colorGAN for comparison, we find it requires over 24 hours to converge stably, after multiple restarts. The results are generally worse than those of the cINN, as shown in Fig. 9 and Table LABEL:tab:colorgan. While the resulting pixel-wise color variance is slightly higher for the colorGAN, it is not clear whether this captures the true variance, or whether it is due to unrealistically colorful outputs, such as in the second row in Fig. 9.

Metric cINN colorGAN
MSE best-of-8 6.14 6.43
Variance 33.69 39.46
FID 26.48 28.31
cINN COLORGAN
Figure 9: Quantitative and qualitative comparison between smaller cINN and colorGAN on LSUN bedrooms. The metrics used are explained in Table 1.

4.4 Ablation of training improvements

To demonstrate the improved stability and training speed through the improvements from Sec. 3.4, we perform ablations, see Fig. 10. The ablations for colorization were performed for the LSUN bedrooms task, due to training speed.

We find that for stable training at Adam learning rates of 10-3, the clamping and Haar wavelet downsampling are strictly necessary. Without these, the network has to be trained with much lower learning rates and more careful and specialized initialization, as used e.g. in [21]. Beyond this, the noise augmentation and permutations lead to the largest improvement in final result. The effect of the noise is more pronounced for MNIST, as large parts of the image are completely black otherwise. For natural images, dequantization of the data is likely to be the main advantage of the added noise. The initialization only improves the final result by a small margin, but also converges noticeably faster.

Figure 10: Training curves for each task, ablating the different improvements.

5 Conclusion and Outlook

We have proposed a conditional invertible neural network architecture which enables guided generation of diverse images with high realism. For image colorization, we believe that even better results can be achieved when employing latest tricks from large-scale GAN frameworks. Especially the non-invertible nature of the conditioning network make cINNs a suitable method for other computer vison tasks such as diverse semantic segmentation.

6 Acknowledgments

LA received funding by the Federal Ministry of Education and Research of Germany, project ‘High Performance Deep Learning Framework’ (No 01IH17002). JK, CR and UK received financial support from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (grant agreement No 647769). Computations were performed on an HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden.

cINN (ours) VAE-MDN [6] cGAN [17] CNN [16] BW Ground truth
MSE best of 8 3.53±0.04 4.06±0.04 9.75±0.06 6.77 ±0.05
Variance 35.2±0.3 21.1±0.2 0.0±0.0
FID [14] 25.13±0.30 25.98±0.28 24.41±0.27 24.95±0.27 30.91±0.27 14.69 ± 0.18
VGG top 5 acc. 85.00±0.48 85.00±0.48 84.62±0.53 86.86±0.41 86.02±0.43 91.66 ± 0.43
Table 1: Comparison of conditional generative models for diverse colorization. We additionally compare to a state-of-the-art regression model (‘CNN’, no diversity), and the grayscale images alone (‘BW’). For each of 5k ImageNet validation images, we compare the best pixel-wise MSE of 8 generated colorization samples, the pixel-wise variance between the 8 samples as an approximation of the diversity, the Fréchet Inception Distance [14] as a measure of realism, and the top 5 accuracy of ImageNet classification performed on the colorized images, to check if semantic content is preserved by the colorization.
Figure 11: Diverse colorizations produced by our cINN.

[1.0mm]

Figure 12: Failure cases of our method. Top: Sampling outliers. Bottom: cINN did not recognize an object’s semantic class or the connectivity of occluded regions.
Figure 13: Alternative methods have lower diversity and lower quality, suffering from inconsistencies within objects, or color blurriness and bleeding (compare creftype 11, bottom).
Grayscale input
𝐳=0.0𝐳*
𝐳=0.7𝐳*
𝐳=0.9𝐳*
𝐳=1.0𝐳*
𝐳=1.25𝐳*
Figure 14: Effects of linearly scaling the latent code 𝐳 while keeping the condition fixed. Vector 𝐳* is “typical” in the sense that 𝐳*2=𝔼[𝐳2], and results in natural colors. As we move closer to the center of the latent space (𝐳<𝐳*), regions with ambiguous colors become desaturated, while less ambiguous regions (e.g. sky, vegetation) revert to their prototypical colors. In the opposite direction (𝐳>𝐳*), colors are enhanced to the point of oversaturation.
Figure 15: For color transfer, we first compute the latent vectors 𝐳 for different color images (𝐋,𝐚,𝐛) (top row). We then send the same 𝐳 vectors through the inverse network with a new grayscale condition 𝐋* (far left) to produce transferred colorizations 𝐚*,𝐛* (bottom row). Differences between reference and output color (e.g. pink rose) can arise from mismatches between the reference colors 𝐚,𝐛 and the intensity prescribed by the new condition 𝐋*.
Figure 16: In an ablation study, we train a cINN using the grayscale image directly as conditional input, without a conditioning network h. The resulting colorizations largely ignore semantic content which leads to exaggerated diversity. More ablations are found in the appendix.

References

  • [1] L. Ardizzone, J. Kruse, C. Rother, and U. Köthe. Analyzing inverse problems with invertible neural networks. In Intl. Conf. on Learning Representations, 2019.
  • [2] J. Behrmann, D. Duvenaud, and J.-H. Jacobsen. Invertible residual networks. arXiv:1811.00995, 2018.
  • [3] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Intl. Conf. on Learning Representations, 2019.
  • [4] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsupervised diverse colorization via generative adversarial networks. In Joint Europ. Conf. on Machine Learning and Knowledge Discovery in Databases, pages 151–166. Springer, 2017.
  • [5] I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and GAN-based training of RealNVPs. arXiv:1705.05263, 2017.
  • [6] A. Deshpande, J. Lu, M.-C. Yeh, M. Jin Chong, and D. Forsyth. Learning diverse image colorization. In Conf. on Computer Vision and Pattern Recognition (CVPR), pages 6837–6845, 2017.
  • [7] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent components estimation. arXiv:1410.8516, 2014.
  • [8] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. arXiv:1605.08803, 2016.
  • [9] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In Intl. Conf. on Learning Representations, 2017.
  • [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. 13. Intl. Conf. Artificial Intelligence and Statistics, pages 249–256, 2010.
  • [11] A. Grover, M. Dhar, and S. Ermon. Flow-GAN: combining maximum likelihood and adversarial learning in generative models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [12] S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy. Pixcolor: Pixel recursive colorization. arXiv:1705.07208, 2017.
  • [13] A. Haar. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen, 69(3):331–371, 1910.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [15] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV’17, pages 1501–1510, 2017.
  • [16] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG), 35(4):110, 2016.
  • [17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR’17, pages 1125–1134, 2017.
  • [18] J.-H. Jacobsen, A. W. Smeulders, and E. Oyallon. i-RevNet: deep invertible networks. In International Conference on Learning Representations, 2018.
  • [19] A. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan. Bidirectional conditional generative adversarial networks. arXiv:1711.07461, 2017.
  • [20] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv:1710.10196, 2017.
  • [21] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv:1807.03039, 2018.
  • [22] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
  • [23] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1312.6114, 2013.
  • [24] M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. CausalGAN: Learning causal implicit generative models with adversarial training. arXiv:1709.02023, 2017.
  • [25] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. In ACM Transactions on Graphics (ToG), volume 26, page 96. ACM, 2007.
  • [26] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flow-based generative model for video. arXiv:1903.01434, 2019.
  • [27] G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. In Europ. Conf. on Computer Vision, pages 577–593. Springer, 2016.
  • [28] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Intl. Conf. on Computer Vision and Pattern Recognition, pages 4681–4690, 2017.
  • [29] Z. Lin, A. Khetan, G. Fanti, and S. Oh. PacGAN: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1498–1507, 2018.
  • [30] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
  • [31] T. Miyato and M. Koyama. cGANs with projection discriminator. In International Conference on Learning Representations, 2018.
  • [32] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image synthesis with spatially-adaptive normalization. arXiv:1903.07291, 2019.
  • [33] A. Royer, A. Kolesnikov, and C. H. Lampert. Probabilistic image colorization. In British Machine Vision Conference (BMVC), 2017.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [35] R. T. Schirrmeister, P. Chrabaszcz, F. Hutter, and T. Ball. Training generative reversible networks. arXiv:1806.01610, 2018.
  • [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [37] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483–3491. 2015.
  • [38] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [39] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018.
  • [40] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • [41] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In Europ.Conf. on Computer Vision, pages 649–666, 2016.
  • [42] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV’17, pages 2223–2232, 2017.
  • [43] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.