Abstract
Object segmentation is a crucial problem that is usually solved by usingsupervised learning approaches over very large datasets composed of both imagesand corresponding object masks. Since the masks have to be provided at pixellevel, building such a dataset for any new domain can be very timeconsuming.We present ReDO, a new model able to extract objects from images without anyannotation in an unsupervised way. It relies on the idea that it should bepossible to change the textures or colors of the objects without changing theoverall distribution of the dataset. Following this assumption, our approach isbased on an adversarial architecture where the generator is guided by an inputsample: given an image, it extracts the object mask, then redraws a new objectat the same location. The generator is controlled by a discriminator thatensures that the distribution of generated images is aligned to the originalone. We experiment with this method on different datasets and demonstrate thegood quality of extracted masks.
Quick Read (beta)
Unsupervised Object Segmentation by Redrawing
Abstract
Object segmentation is a crucial problem that is usually solved by using supervised learning approaches over very large datasets composed of both images and corresponding object masks. Since the masks have to be provided at pixel level, building such a dataset for any new domain can be very timeconsuming. We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way. It relies on the idea that it should be possible to change the textures or colors of the objects without changing the overall distribution of the dataset. Following this assumption, our approach is based on an adversarial architecture where the generator is guided by an input sample: given an image, it extracts the object mask, then redraws a new object at the same location. The generator is controlled by a discriminator that ensures that the distribution of generated images is aligned to the original one. We experiment with this method on different datasets and demonstrate the good quality of extracted masks.
Unsupervised Object Segmentation by Redrawing
Mickaël Chen Sorbonne Université, CNRS, LIP6, F75005, Paris, France [email protected] Thierry Artières Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France Ecole Centrale Marseille [email protected] Ludovic Denoyer Facebook Artificial Intelligence Research [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Image segmentation aims at splitting a given image into a set of nonoverlapping regions corresponding to the main components in the image. It has been studied for a long time in an unsupervised setting using prior knowledge on the nature of the region one wants to detect using e.g. normalized cuts and graphbased methods. Recently the rise of deep neural networks and their spectacular performances on many difficult computer vision tasks have led to revisit the image segmentation problem using deep networks in a fully supervised setting [5, 20, 58], a problem referred as semantic image segmentation.
Although such modern methods allowed learning successful semantic segmentation systems, their training requires largescale labeled datasets with usually a need for pixellevel annotations. This feature limits the use of such techniques for many image segmentation tasks for which no such large scale supervision is available. To overcome this drawback, we follow here a very recent trend that aims at revisiting the unsupervised image segmentation problem with new tools and new ideas from the recent history and success of deep learning [55] and from the recent results of supervised semantic segmentation [5, 20, 58].
Building on the idea of scene composition [4, 14, 18, 56] and on the adversarial learning principle [17], we propose to address the unsupervised segmentation problem in a new way. We start by postulating an underlying generative process for images that relies on an assumption of independence between regions of an image we want to detect. This means that replacing one object in the image with another one, e.g. a generated one, should yield a realistic image. We use such a generative model as a backbone for designing an object segmentation model we call ReDO (ReDrawing of Objects), which outputs are then used to modify the input image by redrawing detected objects. Following ideas from adversarial learning, the supervision of the whole system is provided by a discriminator that is trained to distinguish between real images and fake images generated accordingly to the generative process. Despite being a simplified model for images, we find this generative process effective for learning a segmentation model.
The paper is organized as follows. We present related work in Section 2, then we describe our method in Section 3. We first define the underlying generative model that we consider in Section 3.2 and detail how we translate this hypothesis into a neural network architecture to learn a segmentation module in Section 3.3. Then we give implementation details in Section 4. Finally, we present experimental results on three datasets in Section 5 that explore the feasibility of unsupervised segmentation within our framework and compare its performance against a baseline supervised with few labeled examples.
2 Related Work
Image segmentation is a very active topic in deep learning that boasts impressive results when using largescale labeled datasets. Those approaches can effectively parse highresolution images depicting complex and diverse realworld scenes into informative semantics or instance maps. Stateoftheart methods use clever architectural choices or pipelines tailored to the challenges of the task [5, 20, 58].
However, most of those models use pixellevel supervision, which can be unavailable in some settings, or timeconsuming to acquire in any case. Some works tackle this problem by using fewer labeled images or weaker overall supervision. One common strategy is to use imagelevel annotations to train a classifier from which class saliency maps can be obtained. Those saliency maps can then be exploited with other means to produce segmentation maps. For instance, WILDCAT [13] uses a Conditional Random Field (CRF) for spatial prediction in order to postprocess class saliency maps for semantic segmentation. PRM [59], instead, finds pixels that provoke peaks in saliency maps and uses these as a reference to choose the best regions out of a large set of proposals previously obtained using MCG [2], an unsupervised region proposal algorithm. Both pipelines use a combination of a deep classifier and a method that take advantage of spatial and visual handcrafted image priors.
Cosegmentation, introduced by Rother et al. 2006 [46], addresses the related problem of segmenting objects that are shared by multiple images by looking for similar data patterns in all those images. Like the aforementioned models, in addition to prior image knowledge, deep saliency maps are often used to localize those objects [23]. Unsupervised cosegmentation [22], i.e. the task of covering objects of a specific category without additional data annotations, is a setup that resembles ours. However, unsupervised cosegmentation systems are built on the idea of exploiting features similarity and can’t easily be extended to a classagnostic system. As we aim to ultimately be able to segment very different objects, our approach instead relies on independence between the contents of different regions of an image which is a more general concept.
Fully unsupervised approaches have traditionally been more focused on designing handcrafted features or energy functions to define the desired property of objectness. Impressive results have been obtained when making full use of depth maps in addition to usual RGB images [44, 49] but it is much harder to specify good energy functions for purely RGB images. WNET [55] extracts latent representations via a deep autoencoder that can then be used by a more classic CRF algorithm. Kanezaki 2018 [28] further incorporate deep priors and train a neural network to directly minimize their chosen intraregion pixel distance. A different approach is proposed by Ji et al. 2019 [26] whose method finds clusters of pixels using a learned distance invariant to some known properties. Unlike ours, none of these approaches are learned entirely from data.
Our work instead follows a more recent trend by inferring scene decomposition directly from data. Stemming from DRAW [19], many of those approaches [4, 14] use an attention network to read a region of an image and a Variational Autoencoder (VAE) to partially reconstruct the image in an iterative process in order to flesh out a meaningful decomposition. LRGAN [56] is able to generate simple scenes recursively, building object after object, and Sbai et al. 2018 [48] decompose an image into singlecolored strokes for vector graphics. While iterative processes have the advantage of being able to handle an arbitrary number of objects, they are also more unstable and difficult to train. Most of those can either only be used in generation [56], or only handle very simple objects [4, 14, 18]. As a proof of concept, we decided to first ignore this additional difficulty by only handling a set number of objects but our model can naturally be extended with an iterative composition process. This choice is common among works that, like ours, focus on other aspects of image compositionality. Van Steenkiste et al. 2018 [52] advocates for a generative framework that accounts for relationship between objects. While they do produce masks as part of their generative process, they cannot segment a given image. Closer to our setup, the very recent IODINE [18] propose a VAE adapted for multiobjects representations. Their learned representations include a scene decomposition, but they need a costly iterative refinement process whose performance have only been demonstrated on simulated datasets and not real images. Like ours, some prior work have tried to find segmentation mask by recomposing new images. SEIGAN [42] and Cut & Paste [45] learns to separate object and background by moving the region corresponding to the object to another background and making sure the image is still realistic. These methods however, need to have access to background images without objects, which might not be easy to obtain.
Our work also ties to recent research in disentangled representation learning. Multiple techniques have been used to separate information in factored latent representations. One line of work focuses on understanding and exploiting the innate disentangling properties of Variational AutoEncoders. It was first observed by $\beta $VAE [21] that VAEs can be constrained to produce disentangled representations by imposing a stronger penalty on the KullbackLeibler divergence term on the VAE loss. FactorVAE [30] and $\beta $TCVAE [7] extract a total correlation term from the KL term of the VAE objective and specifically reweight it instead of the whole KL term. In a similar fashion, HFVAE [15] introduces a hierarchical decomposition of the KL term to impose a structure on the latent space. A similar property can be observed with GANbased models, as shown by InfoGAN [9] which forces a generator to map a code to interpretable features by maximizing the mutual information between the code and the output. Using adversarial training is also a good way to split and control information in latent embeddings. Fader Networks [32] uses adversarial training to remove specific class information from a vector. This technique is also used in adversarial domain adaptation [16, 36, 51] to align embeddings from different domains. Similar methods can be used to build factorial representations instead of simply removing information [6, 10, 11, 38]. Like our work, they use adversarial learning to match an implicitly predefined generative model but for purposes unrelated to segmentation.
3 Method
3.1 Overview
A segmentation process $\U0001d675$ splits a given image $\mathbf{I}\in {\mathbb{R}}^{W\times H\times C}$ into a set of nonoverlapping regions. $\U0001d675$ can be described as a function that assigns to each pixel coordinate of $\mathbf{I}$ one of $n$ regions. The problem is then to find a correct partition $\U0001d675$ for any given image $\mathbf{I}$. Lacking supervision, a common strategy is to define properties one wants the regions to have, and then to find a partition that produces regions with such properties. This can be done by defining an energy function and then finding an optimal split. The challenge is then to accurately describe and model the statistical properties of meaningful regions as a function one can optimize.
We address this problem differently. Instead of trying to define the right properties of regions at the level of each image, we make assumptions about the underlying generative process of images in which the different regions are explicitly modeled. Then, by using an adversarial approach, we learn the parameters of the different components of our model so that the overall distribution of the generated images matches the distribution of the dataset. We detail the generative process in the section 3.2, while the way we learn $\U0001d675$ is detailed in Section 3.3.
3.2 Generative Process
We consider that images are produced by a generative process that operates in three steps: first, it defines the different regions in the image i.e the organization of the scene (composition step). Then, given this segmentation, the process generates the pixels for each of the regions independently (drawing step). At last, the resulting regions are assembled into the final image (assembling step).
Let us consider a scene composed of $n1$ objects and one background we refer to as object $n$. Let us denote ${\mathbf{M}}^{k}\in {\{0,1\}}^{W\times H}$ the mask corresponding to object $k$ which associates one binary value to each pixels in the final image so that ${\mathbf{M}}_{x,y}^{k}=1$ iff the pixel of coordinate $(x,y)$ belongs to object $k$. Note that, since one pixel can only belong to one object, the masks have to satisfy $\sum _{k=1}^{n}{\mathbf{M}}_{x,y}^{k}=1$ and the background mask ${\mathbf{M}}^{n}$ can therefore easily be retrieved computed from the object masks as ${\mathbf{M}}^{n}=1\sum _{k=1}^{n1}{\mathbf{M}}^{k}$.
The pixel values of each object $k$ are denoted ${\mathbf{V}}^{k}\in {\mathbb{R}}^{W\times H\times C}$. Given that the image we generate is of size $W\times H\times C$, each object is associated with an image of the same size but only the pixels selected by the mask will be used to compose the output image. The final composition of the objects into an image is computed as follows:
$$\mathbf{I}\leftarrow \sum _{k=1}^{n}{\mathbf{M}}^{k}\odot {\mathbf{V}}^{k}.$$ 
To recap, the underlying generative process described previously can be summarized as follow: i) first, the masks ${\mathbf{M}}^{k}$ are chosen together based on a mask prior $p(\mathbf{M})$. ii) Then, for each object independently, the pixel values are chosen based on a distribution $p({\mathbf{V}}^{k}{\mathbf{M}}^{k},k)$. iii) Finally, the objects are assembled into a complete image.
This process makes an assumption of independence between the colors and textures of the different objects composing a scene. While this is a naive assumption, as colorimetric values such as exposition, brightness, or even the real colors of two objects, are often related, this simple model still serves as a good prior for our purposes.
3.3 From Generative Process to Object Segmentation
Now, instead of considering a purely generative process where the masks are generated following a prior $p(\mathbf{M})$, we consider the inductive process where the masks are extracted directly from any input image $\mathbf{I}$ through the function $\U0001d675$ which is the object segmentation function described previously. The role of $\U0001d675$ is thus to output a set of masks given any input $\mathbf{I}$. The new generative process acts as follows: i) it takes a random image in the dataset and computes the masks using $\U0001d675(\mathbf{I})\to {\mathbf{M}}_{1},\mathrm{\dots},{\mathbf{M}}_{n}$, and ii) it generates new pixel values for the regions in the image according to a distribution $p({\mathbf{V}}^{k}{\mathbf{M}}^{k},k)$. iii) It aggregates the objects as before.
In order for output images to match the distribution of the training dataset, all the components (i.e $\U0001d675$ and $p({\mathbf{V}}^{k}{\mathbf{M}}^{k},k)$) are learned adversarially following the GAN approach. Let us define $\U0001d673:{\mathbb{R}}^{W\times H\times C}\to \mathbb{R}$ a discriminator function able to classify images as fake or real. Let us denote ${\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n})$ our generator function able to compose a new image given an input image $\mathbf{I}$, an object segmentation function $\U0001d675$, and a set of vectors ${\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n}$ each sampled independently following a prior $p(\mathbf{z})$ for each object $k$, background included. Since the pixel values of the different regions are considered as independent given the segmentation, our generator can be decomposed in $n$ generators denoted ${\U0001d676}_{k}({\mathbf{M}}^{k},{\mathbf{z}}_{k})$, each one being in charge of deciding the pixel values for one specific region. The complete image generation process thus operates in three steps:
1)  ${\mathbf{M}}^{1},\mathrm{\dots},{\mathbf{M}}^{n}\leftarrow \U0001d675(\mathbf{I})$  (composition step)  
2)  ${\mathbf{V}}^{k}\leftarrow {\U0001d676}_{k}({\mathbf{M}}^{k},{\mathbf{z}}_{k})\text{for}k\in \{1,\mathrm{\dots},n\}$  (drawing step)  
3)  ${\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n})={\displaystyle \sum _{k=1}^{n}}{\mathbf{M}}^{k}\odot {\mathbf{V}}^{k}$  (assembling step). 
Provided the functions $\U0001d675$ and ${\U0001d676}_{k}$ are differentiable, they can thus be learned by solving the following adversarial problem:
$$\underset{{\U0001d676}_{\U0001d675}}{\mathrm{min}}\underset{\U0001d673}{\mathrm{max}}\mathcal{L}={\mathbb{E}}_{\mathbf{I}\sim {p}_{data}}\left[\mathrm{log}\U0001d673(\mathbf{I})\right]+{\mathbb{E}}_{\mathbf{I}\sim {p}_{data},{\mathbf{z}}_{1},\mathrm{\dots}{\mathbf{z}}_{n}\sim p(\mathbf{z})}\left[\mathrm{log}(1\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n})))\right].$$ 
Therefore, in practice we have $\U0001d675$ output soft masks in $[0,1]$ instead of binary masks. Also, in line with recent GAN literature [3, 39, 50, 57], we choose to use the hinge version of the adversarial loss [35, 50] instead, and obtain the following formulation:
$$\begin{array}{cc}\hfill \underset{{\U0001d676}_{\U0001d675}}{\mathrm{max}}{\mathcal{L}}_{\U0001d676}=& {\mathbb{E}}_{\mathbf{I}\sim {p}_{data},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n}\sim p(\mathbf{z})}\left[\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n}))\right]\hfill \\ \hfill \underset{\U0001d673}{\mathrm{max}}{\mathcal{L}}_{\U0001d673}=& {\mathbb{E}}_{\mathbf{I}\sim {p}_{data}}\left[\mathrm{min}(0,1+\U0001d673(\mathbf{I}))\right]\hfill \\ & +{\mathbb{E}}_{\mathbf{I}\sim {p}_{data},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n}\sim p(\mathbf{z})}\left[\mathrm{min}(0,1\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{1},\mathrm{\dots},{\mathbf{z}}_{n})))\right].\hfill \end{array}$$ 
Still, as it stands, the learning process of this model may fail for two reasons. First, it does not have to extract a meaningful segmentation in regards to the input $\mathbf{I}$. Indeed, since the values of all the output pixels will be generated, $\mathbf{I}$ can be ignored entirely to generate plausible pictures. For instance, the segmentation could be the same for all the inputs regardless of input $\mathbf{I}$. Second, it naturally converges to a trivial extractor $\U0001d675$ that puts the whole image into a single region, the other regions being empty. We thus have to add additional constraints to our model.
Constraining mask extraction by redrawing a single region.
The first constraint aims at forcing the model to extract meaningful region masks instead of ignoring the image. To this end, we take advantage of the assumption that the different objects are independently generated. We can, therefore, replace only one region at each iteration instead of regenerating all the regions. Since the generator now has to use original pixel values from the image in the reassembled image, it cannot make arbitrary splits. The generation process becomes as follows:
1)  ${\mathbf{M}}^{1},\mathrm{\dots},{\mathbf{M}}^{n}\leftarrow \U0001d675(\mathbf{I})$  (composition step)  
2)  ${\mathbf{V}}^{k}\leftarrow \mathbf{I}\text{for}k\in \{1,\mathrm{\dots},n\}\setminus \{\mathbf{i}\}$  
${\mathbf{V}}^{\mathbf{i}}\leftarrow {\U0001d676}_{\mathbf{i}}({\mathbf{M}}^{\mathbf{i}},{\mathbf{z}}_{\mathbf{i}})$  (drawing step)  
3)  ${\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})={\displaystyle \sum _{k=1}^{n}}{\mathbf{M}}^{k}\odot {\mathbf{V}}^{k}$  (assembling step), 
where $\mathbf{i}$ designates the index of the only region to redraw and is sampled from $\mathcal{U}(n)$, the discrete uniform distribution on $\{1,\mathrm{\dots},n\}$. The new learning objectives are as follows:
$$\begin{array}{cc}\hfill \underset{{\U0001d676}_{\U0001d675}}{\mathrm{max}}{\mathcal{L}}_{\U0001d676}& ={\mathbb{E}}_{\mathbf{I}\sim {p}_{data},\mathbf{i}\sim \mathcal{U}(n),{\mathbf{z}}_{\mathbf{i}}\sim p(\mathbf{z})}\left[\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i}))\right]\hfill \\ \hfill \underset{\U0001d673}{\mathrm{max}}{\mathcal{L}}_{\U0001d673}& ={\mathbb{E}}_{\mathbf{I}\sim {p}_{data}}[\mathrm{min}(0,1+\U0001d673(\mathbf{I}))]+{\mathbb{E}}_{\mathbf{I}\sim {p}_{data},\mathbf{i}\sim \mathcal{U}(n),{\mathbf{z}}_{\mathbf{i}}\sim p(\mathbf{z})}[\mathrm{min}(0,1\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})))].\hfill \end{array}$$ 
Conservation of Region Information.
The second constraint is that given a region $\mathbf{i}$ generated from a latent vector ${\mathbf{z}}_{\mathbf{i}}$, the final image ${\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})$ must contain information about ${\mathbf{z}}_{\mathbf{i}}$. This constraint is designed to prevent the mask extractor $\U0001d675$ to produce empty regions. Indeed, if region $\mathbf{i}$ is empty, i.e. ${\mathbf{M}}_{x,y}^{\mathbf{i}}=0$ for all $x,y$, then ${\mathbf{z}}_{\mathbf{i}}$ cannot be retrieved from the final image. Equivalently, if ${\mathbf{z}}_{\mathbf{i}}$ can be retrieved, then region $\mathbf{i}$ is not empty. This information conservation constraint is implemented through an additional term in the loss function. Let us denote ${\delta}_{k}$ a function which objective is to infer the value of ${\mathbf{z}}_{k}$ given any image $\mathbf{I}$. One can learn such a function simultaneously to promote conservation of information by the generator. This strategy is similar to the mutual information maximization used in InfoGAN. [9].
The final complete process is illustrated in Figure 1 and correspond to the following learning objectives:
$$\begin{array}{cc}\hfill \underset{{\U0001d676}_{\U0001d675},\delta}{\mathrm{max}}{\mathcal{L}}_{\U0001d676}& ={\mathbb{E}}_{\mathbf{I}\sim {p}_{data},\mathbf{i}\sim \mathcal{U}(n),{z}_{\mathbf{i}}\sim p(z)}\left[\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})){\lambda}_{z}{{\delta}_{\mathbf{i}}({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})){\mathbf{z}}_{\mathbf{i}}}_{2}^{2}\right]\hfill \\ \hfill \underset{\U0001d673}{\mathrm{max}}{\mathcal{L}}_{\U0001d673}& ={\mathbb{E}}_{\mathbf{I}\sim {p}_{data}}[\mathrm{min}(0,1+\U0001d673(\mathbf{I})]+{\mathbb{E}}_{\mathbf{I}\sim {p}_{data},\mathbf{i}\sim \mathcal{U}(n),{\mathbf{z}}_{\mathbf{i}}\sim p(\mathbf{z})}[\mathrm{min}(0,1\U0001d673({\U0001d676}_{\U0001d675}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i}))],\hfill \end{array}$$ 
where ${\lambda}_{z}$ is a fixed hyperparameter that controls the strength of the information conservation constraint. Note that the constraint is necessary for our model to find non trivial solutions, as otherwise, putting the whole image into a single region is both optimal and easy to discover for the neural networks. The final learning algorithm follows classical GAN schema [3, 17, 39, 57] by updating the generator and the discriminator alternatively with the update functions presented in Algorithm 1.
{algorithmic}[1] \ProcedureGeneratorUpdate \Statesample data $\mathbf{I}\sim {p}_{data}$, \Statesample region $\mathbf{i}\sim \mathrm{\U0001d684\U0001d697\U0001d692\U0001d68f\U0001d698\U0001d69b\U0001d696}(\{1,\mathrm{\dots},n\})$ \Statesample noise vector ${\mathbf{z}}_{\mathbf{i}}\sim p(\mathbf{z})$ \State${\mathbf{I}}_{gen}\leftarrow {\U0001d676}_{\mathbf{f}}(\mathbf{I},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})$ \Commentgenerate image \State${\mathcal{L}}_{z}\leftarrow {{\delta}_{\mathbf{i}}({\mathbf{I}}_{gen}){\mathbf{z}}_{\mathbf{i}}}_{2}$ \Commentcompute information conservation loss \State${\mathcal{L}}_{G}\leftarrow \U0001d673({\mathbf{I}}_{gen})$ \Commentcompute adversarial loss \Stateupdate ${\U0001d676}_{\mathbf{f}}$ with ${\nabla}_{{\U0001d676}_{\mathbf{f}}}\left[{\mathcal{L}}_{G}+{\mathcal{L}}_{z}\right]$ \Stateupdate ${\delta}_{\mathbf{i}}$ with ${\nabla}_{{\delta}_{\mathbf{i}}}{\mathcal{L}}_{z}$ \EndProcedure\ProcedureDiscriminatorUpdate \Statesample datapoints ${\mathbf{I}}_{real},{\mathbf{I}}_{input}\sim {p}_{data}$ \Statesample region $\mathbf{i}\sim \mathrm{\U0001d684\U0001d697\U0001d692\U0001d68f\U0001d698\U0001d69b\U0001d696}(\{1,\mathrm{\dots},n\})$ \Statesample noise vector ${\mathbf{z}}_{\mathbf{i}}\sim p(\mathbf{z})$ \State${\mathbf{I}}_{gen}\leftarrow {\U0001d676}_{\mathbf{f}}({\mathbf{I}}_{input},{\mathbf{z}}_{\mathbf{i}},\mathbf{i})$ \Commentgenerate image \State${\mathcal{L}}_{D}\leftarrow \mathrm{min}(0,1+\U0001d673({\mathbf{I}}_{real}))+\mathrm{min}(0,1\U0001d673({\mathbf{I}}_{gen})$ \Commentcompute adversarial loss \Stateupdate $\U0001d673$ with ${\nabla}_{\U0001d673}{\mathcal{L}}_{D}$ \EndProcedure
4 Implementation
We now provide some information about the architecture of the different components (additional details are given in Supplementary materials). As usual with GANbased methods, the choice of a good architecture is crucial. We have chosen to build on the GAN and the image segmentation literature and to take inspiration from the neural network architectures they propose.
For the mask generator $\U0001d675$, we use an architecture inspired by PSPNet [58]. The proposed architecture is a fully convolutional neural network similar to one used in imagetoimage translation [60], to which we add a Pyramid Pooling Module [58] whose goal is to gather information on different scales via pooling layers. The final representation of a given pixel is thus encouraged to contain local, regional, and global information at the same time.
The region generators ${\U0001d676}_{k}$, the discriminator $\U0001d673$ and the network $\delta $ that reconstructs $\mathbf{z}$ are based on SAGAN [57] that is frequently used in recent GAN literature [3, 37]. Notably, we use spectral normalization [39] for weight regularization for all networks except for the mask provider F, and we use selfattention [57] in ${\U0001d676}_{k}$ and $\U0001d673$ to handle nonlocal relations. To both promote stochasticity in our generators and encourage our latent code $\mathbf{z}$ to encode for texture and colors, we also use conditional batchnormalization in ${\U0001d676}_{k}$. The technique has emerged from style modeling for style transfer tasks [12, 43] and has since been used for GANs as a mean to encode for style and to improve stochasticity [1, 8, 54]. All parameters of the different ${\delta}_{k}$ functions are shared except for their last layers.
As it is standard practice for GANs [3], we use orthogonal initialization [47] for our networks and ADAM [31] with $\beta =(0,.9)$ as optimizer. Learning rates are set to ${10}^{4}$ except for the mask network $\U0001d675$ which uses a smaller value of ${10}^{5}$. We sample noise vectors ${\mathbf{z}}_{\mathbf{i}}$ of size $32$ (except for MNIST where we used vectors of size $16$) from $\mathcal{N}(0,{I}_{d})$ distribution. We used minibatches of size 25 and ran each experiment on a single NVidia Tesla P100 GPU. Despite our conservation of information loss, the model can still collapse into generating empty masks at the early steps of the training. While the regularization does alleviate the problem, we suppose that the mask generator $\U0001d675$ can collapse even before the network $\delta $ learns anything relevant and can act as a stabilizer. As the failures happen early and are easy to detect, we automatically restart the training should the case arise.
We identified ${\lambda}_{z}$ and the initialization scheme as critical hyperparameters and focus our hyperparameters search on those. More details, along with specifics of the implementation used in our experiments are provided as Supplementary materials. The code, dataset splits and pretrained models are also available opensource ^{1}^{1} 1 https://github.com/mickaelChen/ReDO.
5 Experiments
5.1 Datasets
We present results on three natural image datasets and one toy dataset. All images have been resized and then cropped to $128\times 128$.
Flowers dataset [40, 41] is composed of 8189 images of flowers. The dataset is provided with a set of masks obtained via an automated method built specifically for flowers [40]. We split into sets of 6149 training images, 1020 validation and 1020 test images and use the provided masks as ground truth for evaluation purpose only.
Labeled Faces in the Wild dataset [25, 33] is a dataset of 13233 faces. A subpart of the funneled version [24] has been segmented and manually annotated [27], providing 2927 groundtruth masks. We use the nonannotated images for our training set. We split the annotated images between validation and testing sets so that there is no overlap in the identity of the persons between both sets. The test set is composed of 1600 images, and the validation set of 1327 images.
The CaltechUCSD Birds 200 2011 (CUB2002011) dataset [53] is a dataset containing 11788 photographs of birds. We use 10000 images for our training split, 1000 for the test split, and the rest for validation.
As a sanity check, we also build a toy dataset colored2MNIST in which each sample is composed of an uniform background on which we draw two colored MNIST [34] numbers: one odd number and one even number. Odd and even numbers have colors sampled from different distributions so that our model can learn to differentiate them. For this dataset, we set $n=3$ as there are three components.
As an additional experiment, we also build a new dataset by fusing Flowers and LFW datasets. This new Flowers+LFW dataset has more variability, and contains different type of objects. We used this dataset to demonstrate that ReDO can work without label information on problems with multiple categories of objects.
5.2 Results
To evaluate our method ReDO, we use two metrics commonly used for segmentation tasks. The pixel classification accuracy (Acc) measures the proportion of pixels that have been assigned to the correct region. The intersection over union (IoU) is the ratio between the area of the intersection between the inferred mask and the ground truth over the area of their union. In both cases, higher is better. Because ReDO is unsupervised and we can’t control which output region corresponds to which object or background in the image, we compute our evaluation based on the regions permutation that matches the ground truth the best. For model selection, we used IoU computed on a held out labeled validation set. When available, we present our evaluation on both the training set and a test set as, in an unsupervised setting, both can be relevant depending on the specific use case. Results are presented in Table 1 and show that ReDO achieves reasonable performance on the three realworld datasets.
We also compared the performance of ReDO, which is unsupervised, with a supervised method, keeping the same architecture for $\U0001d675$ in both cases. We analyze how many training samples are needed to reach the performance of the unsupervised model (see Figure 4). One can see that the unsupervised results are in the range of the ones obtained with a supervised method, and usually outperform supervised models trained with less than around 50 or 100 examples depending on the dataset. For instance, on the LFW Dataset, the unsupervised model obtains about $92\%$ of accuracy and $79\%$ IoU and the supervised model needs 5060 labeled examples to reach similar performance.
We provide random samples of extracted masks (Figure 2) and the corresponding generated images with a redrawn object or background. Note that our objective is not to generate appealing images but to learn an object segmentation function. Therefore, ReDO generates images that are less realistic than the ones generated by stateoftheart GANs. Focus is, instead, put on the extracted masks, and we can see the good quality of the obtained segmentation in many cases. Best and worst masks, as well as more random samples, are displayed in Supplementary materials.
We also trained ReDO on the fused Flowers+LFW dataset without labels. We reused directly the hyperparameters we have used to fit the Flowers dataset without further tuning and obtained, as preliminary results, a reasonable accuracy of 0.856 and an IoU of 0.691. This shows that ReDO is able to infer class information from masks even in a fully unsupervised setup. Samples are displayed in Figure 3.
Dataset  Train Acc  Train IoU  Test Acc  Test IoU 

LFW      0.917 $\pm $ 0.002  0.781 $\pm $ 0.005 
CUB  0.840 $\pm $ 0.012  0.423 $\pm $ 0.023  0.845 $\pm $ 0.012  0.426 $\pm $ 0.025 
Flowers*  0.886 $\pm $ 0.008  0.780 $\pm $ 0.012  0.879 $\pm $ 0.008  0.764 $\pm $ 0.012 
Flowers+LFW      0.856  0.691 
6 Conclusion
We presented a novel method called ReDO for unsupervised learning to segment images. Our proposal is based on the assumption that if a segmentation model is accurate, then one could edit any real image by replacing any segmented object in a scene by another one, randomly generated, and the result would still be a realistic image. This principle allows casting the unsupervised learning of image segmentation as an adversarial learning problem. Our experimental results obtained on three datasets show that this principle works. In particular, our segmentation model is competitive with supervised approaches trained on a few hundred labeled examples.
Our future work will focus on handling more complex and diverse scenes. As mentioned in Section 2, our model could generalize to an arbitrary number of objects and objects of unknown classes via iterative design and/or class agnostic generators. Currently, we are mostly limited by our ability to effectively train GANs on those more complicated settings but rapid advances in image generation [3, 29, 37] make it a reasonable goal to pursue in a near future. Meanwhile, we will be investigating the use of the model in a semisupervised or weaklysupervised setup. Indeed, additional information would allow us to guide our model for harder datasets while requiring fewer labels than fully supervised approaches. Conversely, our model could act as a regularizer by providing a prior for any segmentation tasks.
Acknowledgments
This work was supported by the French National Research Agency projects LIVES (grant number ANR15CE23002603) and "Deep in France" (grant number ANR16CE230006).
References
 [1] (2018) Augmented cyclegan: learning manytomany mappings from unpaired data. In International Conference on Machine Learning, pp. 195–204. Cited by: §4.
 [2] (2014) Multiscale combinatorial grouping. In Computer Vision and Pattern Recognition, Cited by: §2.
 [3] (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §3.3, §3.3, §4, §4, §6.
 [4] (2019) MONet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §1, §2.
 [5] (2018) Encoderdecoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §1, §1, §2.
 [6] (2018) Multiview data generation without view supervision. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [7] (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
 [8] (2019) On self modulation for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §4.
 [9] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2, §3.3.
 [10] (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4414–4423. External Links: Link Cited by: §2.
 [11] (2018) Semantically decomposing the latent spaces of generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [12] (2017) A learned representation for artistic style. Proc. of ICLR 2. Cited by: §4.
 [13] (2017) WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [14] (2016) Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §1, §2.
 [15] (2019) Structured disentangled representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2525–2534. Cited by: §2.
 [16] (2015) Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 1180–1189. Cited by: §2.
 [17] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.3.
 [18] (2019) Multiobject representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §1, §2.
 [19] (2015) DRAW: a recurrent neural network for image generation. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 1462–1471. Cited by: §2.
 [20] (2017) Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §1, §2.
 [21] (2017) Betavae: learning basic visual concepts with a constrained variational framework. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Link Cited by: §2.
 [22] (2018) Coattention cnns for unsupervised object cosegmentation.. In IJCAI, pp. 748–756. Cited by: §2.
 [23] (2019) DeepCO 3: deep instance cosegmentation by copeak search and cosaliency detection. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [24] (2007) Unsupervised joint alignment of complex images. In ICCV, Cited by: §5.1.
 [25] (200710) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 0749, University of Massachusetts, Amherst. Cited by: §5.1.
 [26] (2019) Invariant information distillation for unsupervised image segmentation and clustering. In International Conference on Computer Vision (ICCV), Cited by: §2.
 [27] (2013) Augmenting CRFs with Boltzmann machine shape priors for image labeling. In CVPR, University of Massachusetts Amherst and University of Michigan Ann Arbor. Cited by: §5.1.
 [28] (2018) Unsupervised image segmentation by backpropagation. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §2.
 [29] (2018) A stylebased generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §6.
 [30] (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §2.
 [31] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 [32] (2017) Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: §2.
 [33] (201405) Labeled faces in the wild: updates and new reporting procedures. Technical report Technical Report UMCS2014003, University of Massachusetts, Amherst. Cited by: §5.1.
 [34] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
 [35] (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §3.3.
 [36] (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §2.
 [37] (2019) Highfidelity image generation with fewer labels. arXiv preprint arXiv:1903.02271. Cited by: §4, §6.
 [38] (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048. Cited by: §2.
 [39] (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.3, §3.3, §4.
 [40] (2007) Delving into the whorl of flower segmentation.. In BMVC, Vol. 2007, pp. 1–10. Cited by: §5.1.
 [41] (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §5.1, Table 1.
 [42] (2018) SEIGAN: towards compositional image generation by simultaneously learning to segment, enhance, and inpaint. arXiv preprint arXiv:1811.07630. Cited by: §2.
 [43] (2018) Film: visual reasoning with a general conditioning layer. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §4.
 [44] (2018) SceneCut: joint geometric and object segmentation for indoor scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §2.
 [45] (2018) Learning to segment via cutandpaste. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 37–52. Cited by: §2.
 [46] (2006) Cosegmentation of image pairs by histogram matchingincorporating a global constraint into mrfs. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1, pp. 993–1000. Cited by: §2.
 [47] (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, External Links: Link Cited by: §4.
 [48] (2018) Vector image generation by learning parametric layer decomposition. arXiv preprint arXiv:1812.05484. Cited by: §2.
 [49] (2012) Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Cited by: §2.
 [50] (2017) Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896 7. Cited by: §3.3.
 [51] (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §2.
 [52] (2018) A case for object compositionality in deep generative models of images. arXiv preprint arXiv:1810.10340. Cited by: §2.
 [53] (2011) The CaltechUCSD Birds2002011 Dataset. Technical report Technical Report CNSTR2011001, California Institute of Technology. Cited by: §5.1.
 [54] (2016) Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pp. 318–335. Cited by: §4.
 [55] (2017) Wnet: a deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506. Cited by: §1, §2.
 [56] (2017) LRGAN: layered recursive generative adversarial networks for image generation. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Link Cited by: §1, §2.
 [57] (2018) Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §3.3, §3.3, §4.
 [58] (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §1, §2, §4.
 [59] (201806) Weakly supervised instance segmentation using class peak response. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [60] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §4.
Supplementary Material for Unsupervised Object Segmentation by Redrawing
We provide architectural details, hyperparameters discussion and additional samples output of our model.
Architectural details
mask network $\U0001d68f(\mathbf{I})$  nonlinearities  output size 

Image $\mathbf{I}$  3x128x128  
Conv 7x7 (reflect. pad 3)  Instance Norm, ReLU  16x128x128 
Conv 3x3 (stride 2, pad 1)  Instance Norm, ReLU  32x64x64 
Conv 3x3 (stride 2, pad 1)  Instance Norm, ReLU  64x32x32 
Residual Bloc (Instance Norm, ReLU)  64x32x32  
Residual Bloc (Instance Norm, ReLU)  64x32x32  
Residual Bloc (Instance Norm, ReLU)  64x32x32  
Pyramid Pooling Module  68x32x32  
Upsample  68x64x64  
Conv 3x3 (pad 1)  Instance Norm, ReLU  34x64x64 
Upsample  34x128x128  
Conv 3x3 (pad 1)  Instance Norm, ReLU  17x128x128 
Conv 3x3 (reflect. pad 3)  sigmoid (if $n=2$) or softmax  $n$x128x128 
discriminator network $\U0001d673$ and encoder $\delta $  output size 

image input $\mathbf{I}$  3x128x128 
Down Res Bloc (ReLU)  64x64x64 
SelfAttention Bloc  
Down Res Bloc (ReLU)  128x32x32 
Down Res Bloc (ReLU)  256x16x16 
Down Res Bloc (ReLU)  512x8x8 
Down Res Bloc (ReLU)  1024x4x4 
Res Bloc (ReLU)  1024x4x4 
Spatial sum pooling  1024x1x1 
Linear  1 for $\U0001d673$, 32 for $\delta $ 
Hyperparameters
We discuss some notable hyperparameters and architectural choices. We identified ${\lambda}_{z}$ and the initialization scheme as critical hyperparameters and focus our search on those, in addition to the standard learning rates search. The number of channels in our generators is also important.

•
Learning rates are set to ${10}^{4}$ except for mask network for which we use a smaller learning rate (${10}^{5}$). We searched for learning rates independently for each neural network.

•
Batch size were set between 20 and 25, as much as we could fit on a single Nvidia Tesla P100 GPU.

•
We chose smaller $siz{e}_{z}$ for each ${\mathbf{z}}_{k}$ (16 for c2MNIST and 32 for the other datasets) than what is usually found in GAN literature so that the vectors could be reasonably retrieved by $\delta $.

•
As expected, we found that ${\lambda}_{z}$ were important to tune for the stability of our training procedure. We use ${\lambda}_{z}=\frac{5}{n*siz{e}_{z}}$ for all datasets except LFW where ${\lambda}_{z}=\frac{15}{n*siz{e}_{z}}$.

•
Adequate orthogonal initialization is critical. We found that our model worked best when initialized with a gain at 0.8, and not at all when set at .2 and 1.4.

•
We tested smaller numbers of channels for our generators ${\U0001d676}_{k}$ since each generator only have to model a specific type of object. We still use the same number ($ch=64$) as the other networks for LFW and colored2MNIST but reduced to $ch=32$ for CUB and Flowers.

•
We used spectral normalization for all our networks except the mask network on which a weight decay of ${10}^{4}$ is applied instead.

•
The use of pyramid pooling produce masks of significantly better quality that standard residual network.

•
The use of selfattention in both D and G might not be mandatory but more experiments would be needed to conclude.

•
We found that having different ratio of discriminator updates and generator updates didn’t help.