Multimodal Unsupervised Image-to-Image Translation

Abstract

Unsupervised image-to-image translation is an important and challengingproblem in computer vision. Given an image in the source domain, the goal is tolearn the conditional distribution of corresponding images in the targetdomain, without seeing any pairs of corresponding images. While thisconditional distribution is inherently multimodal, existing approaches make anoverly simplified assumption, modeling it as a deterministic one-to-onemapping. As a result, they fail to generate diverse outputs from a given sourcedomain image. To address this limitation, we propose a Multimodal UnsupervisedImage-to-image Translation (MUNIT) framework. We assume that the imagerepresentation can be decomposed into a content code that is domain-invariant,and a style code that captures domain-specific properties. To translate animage to another domain, we recombine its content code with a random style codesampled from the style space of the target domain. We analyze the proposedframework and establish several theoretical results. Extensive experiments withcomparisons to the state-of-the-art approaches further demonstrates theadvantage of the proposed framework. Moreover, our framework allows users tocontrol the style of translation outputs by providing an example style image.Code and pretrained models are available at https://github.com/nvlabs/MUNIT.

Quick Read (beta)

loading the full paper ...