Unsupervised Object Segmentation by Redrawing

  • 2019-11-29 17:00:48
  • Mickaël Chen, Thierry Artières, Ludovic Denoyer
  • 0

Abstract

Object segmentation is a crucial problem that is usually solved by usingsupervised learning approaches over very large datasets composed of both imagesand corresponding object masks. Since the masks have to be provided at pixellevel, building such a dataset for any new domain can be very time-consuming.We present ReDO, a new model able to extract objects from images without anyannotation in an unsupervised way. It relies on the idea that it should bepossible to change the textures or colors of the objects without changing theoverall distribution of the dataset. Following this assumption, our approach isbased on an adversarial architecture where the generator is guided by an inputsample: given an image, it extracts the object mask, then redraws a new objectat the same location. The generator is controlled by a discriminator thatensures that the distribution of generated images is aligned to the originalone. We experiment with this method on different datasets and demonstrate thegood quality of extracted masks.

 

Quick Read (beta)

Unsupervised Object Segmentation by Redrawing

Mickaël Chen
Sorbonne Université, CNRS, LIP6, F-75005, Paris, France
[email protected]
&Thierry Artières
Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France
Ecole Centrale Marseille
[email protected]
&Ludovic Denoyer
Facebook Artificial Intelligence Research
[email protected]
Abstract

Object segmentation is a crucial problem that is usually solved by using supervised learning approaches over very large datasets composed of both images and corresponding object masks. Since the masks have to be provided at pixel level, building such a dataset for any new domain can be very time-consuming. We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way. It relies on the idea that it should be possible to change the textures or colors of the objects without changing the overall distribution of the dataset. Following this assumption, our approach is based on an adversarial architecture where the generator is guided by an input sample: given an image, it extracts the object mask, then redraws a new object at the same location. The generator is controlled by a discriminator that ensures that the distribution of generated images is aligned to the original one. We experiment with this method on different datasets and demonstrate the good quality of extracted masks.

 

Unsupervised Object Segmentation by Redrawing


  Mickaël Chen Sorbonne Université, CNRS, LIP6, F-75005, Paris, France [email protected] Thierry Artières Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France Ecole Centrale Marseille [email protected] Ludovic Denoyer Facebook Artificial Intelligence Research [email protected]

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

Image segmentation aims at splitting a given image into a set of non-overlapping regions corresponding to the main components in the image. It has been studied for a long time in an unsupervised setting using prior knowledge on the nature of the region one wants to detect using e.g. normalized cuts and graph-based methods. Recently the rise of deep neural networks and their spectacular performances on many difficult computer vision tasks have led to revisit the image segmentation problem using deep networks in a fully supervised setting [5, 20, 58], a problem referred as semantic image segmentation.

Although such modern methods allowed learning successful semantic segmentation systems, their training requires large-scale labeled datasets with usually a need for pixel-level annotations. This feature limits the use of such techniques for many image segmentation tasks for which no such large scale supervision is available. To overcome this drawback, we follow here a very recent trend that aims at revisiting the unsupervised image segmentation problem with new tools and new ideas from the recent history and success of deep learning [55] and from the recent results of supervised semantic segmentation [5, 20, 58].

Building on the idea of scene composition [4, 14, 18, 56] and on the adversarial learning principle [17], we propose to address the unsupervised segmentation problem in a new way. We start by postulating an underlying generative process for images that relies on an assumption of independence between regions of an image we want to detect. This means that replacing one object in the image with another one, e.g. a generated one, should yield a realistic image. We use such a generative model as a backbone for designing an object segmentation model we call ReDO (ReDrawing of Objects), which outputs are then used to modify the input image by redrawing detected objects. Following ideas from adversarial learning, the supervision of the whole system is provided by a discriminator that is trained to distinguish between real images and fake images generated accordingly to the generative process. Despite being a simplified model for images, we find this generative process effective for learning a segmentation model.

The paper is organized as follows. We present related work in Section 2, then we describe our method in Section 3. We first define the underlying generative model that we consider in Section 3.2 and detail how we translate this hypothesis into a neural network architecture to learn a segmentation module in Section 3.3. Then we give implementation details in Section 4. Finally, we present experimental results on three datasets in Section 5 that explore the feasibility of unsupervised segmentation within our framework and compare its performance against a baseline supervised with few labeled examples.

2 Related Work

Image segmentation is a very active topic in deep learning that boasts impressive results when using large-scale labeled datasets. Those approaches can effectively parse high-resolution images depicting complex and diverse real-world scenes into informative semantics or instance maps. State-of-the-art methods use clever architectural choices or pipelines tailored to the challenges of the task [5, 20, 58].

However, most of those models use pixel-level supervision, which can be unavailable in some settings, or time-consuming to acquire in any case. Some works tackle this problem by using fewer labeled images or weaker overall supervision. One common strategy is to use image-level annotations to train a classifier from which class saliency maps can be obtained. Those saliency maps can then be exploited with other means to produce segmentation maps. For instance, WILDCAT [13] uses a Conditional Random Field (CRF) for spatial prediction in order to post-process class saliency maps for semantic segmentation. PRM [59], instead, finds pixels that provoke peaks in saliency maps and uses these as a reference to choose the best regions out of a large set of proposals previously obtained using MCG [2], an unsupervised region proposal algorithm. Both pipelines use a combination of a deep classifier and a method that take advantage of spatial and visual handcrafted image priors.

Co-segmentation, introduced by Rother et al. 2006 [46], addresses the related problem of segmenting objects that are shared by multiple images by looking for similar data patterns in all those images. Like the aforementioned models, in addition to prior image knowledge, deep saliency maps are often used to localize those objects [23]. Unsupervised co-segmentation [22], i.e. the task of covering objects of a specific category without additional data annotations, is a setup that resembles ours. However, unsupervised co-segmentation systems are built on the idea of exploiting features similarity and can’t easily be extended to a class-agnostic system. As we aim to ultimately be able to segment very different objects, our approach instead relies on independence between the contents of different regions of an image which is a more general concept.

Fully unsupervised approaches have traditionally been more focused on designing handcrafted features or energy functions to define the desired property of objectness. Impressive results have been obtained when making full use of depth maps in addition to usual RGB images [44, 49] but it is much harder to specify good energy functions for purely RGB images. W-NET [55] extracts latent representations via a deep auto-encoder that can then be used by a more classic CRF algorithm. Kanezaki 2018 [28] further incorporate deep priors and train a neural network to directly minimize their chosen intra-region pixel distance. A different approach is proposed by Ji et al. 2019 [26] whose method finds clusters of pixels using a learned distance invariant to some known properties. Unlike ours, none of these approaches are learned entirely from data.

Our work instead follows a more recent trend by inferring scene decomposition directly from data. Stemming from DRAW [19], many of those approaches [4, 14] use an attention network to read a region of an image and a Variational Auto-encoder (VAE) to partially reconstruct the image in an iterative process in order to flesh out a meaningful decomposition. LR-GAN [56] is able to generate simple scenes recursively, building object after object, and Sbai et al. 2018 [48] decompose an image into single-colored strokes for vector graphics. While iterative processes have the advantage of being able to handle an arbitrary number of objects, they are also more unstable and difficult to train. Most of those can either only be used in generation [56], or only handle very simple objects [4, 14, 18]. As a proof of concept, we decided to first ignore this additional difficulty by only handling a set number of objects but our model can naturally be extended with an iterative composition process. This choice is common among works that, like ours, focus on other aspects of image compositionality. Van Steenkiste et al. 2018 [52] advocates for a generative framework that accounts for relationship between objects. While they do produce masks as part of their generative process, they cannot segment a given image. Closer to our setup, the very recent IODINE [18] propose a VAE adapted for multi-objects representations. Their learned representations include a scene decomposition, but they need a costly iterative refinement process whose performance have only been demonstrated on simulated datasets and not real images. Like ours, some prior work have tried to find segmentation mask by recomposing new images. SEIGAN [42] and Cut & Paste [45] learns to separate object and background by moving the region corresponding to the object to another background and making sure the image is still realistic. These methods however, need to have access to background images without objects, which might not be easy to obtain.

Our work also ties to recent research in disentangled representation learning. Multiple techniques have been used to separate information in factored latent representations. One line of work focuses on understanding and exploiting the innate disentangling properties of Variational Auto-Encoders. It was first observed by β-VAE [21] that VAEs can be constrained to produce disentangled representations by imposing a stronger penalty on the Kullback-Leibler divergence term on the VAE loss. FactorVAE [30] and β-TCVAE [7] extract a total correlation term from the KL term of the VAE objective and specifically re-weight it instead of the whole KL term. In a similar fashion, HFVAE [15] introduces a hierarchical decomposition of the KL term to impose a structure on the latent space. A similar property can be observed with GAN-based models, as shown by InfoGAN [9] which forces a generator to map a code to interpretable features by maximizing the mutual information between the code and the output. Using adversarial training is also a good way to split and control information in latent embeddings. Fader Networks [32] uses adversarial training to remove specific class information from a vector. This technique is also used in adversarial domain adaptation [16, 36, 51] to align embeddings from different domains. Similar methods can be used to build factorial representations instead of simply removing information [6, 10, 11, 38]. Like our work, they use adversarial learning to match an implicitly predefined generative model but for purposes unrelated to segmentation.

3 Method

3.1 Overview

A segmentation process 𝙵 splits a given image 𝐈W×H×C into a set of non-overlapping regions. 𝙵 can be described as a function that assigns to each pixel coordinate of 𝐈 one of n regions. The problem is then to find a correct partition 𝙵 for any given image 𝐈. Lacking supervision, a common strategy is to define properties one wants the regions to have, and then to find a partition that produces regions with such properties. This can be done by defining an energy function and then finding an optimal split. The challenge is then to accurately describe and model the statistical properties of meaningful regions as a function one can optimize.

We address this problem differently. Instead of trying to define the right properties of regions at the level of each image, we make assumptions about the underlying generative process of images in which the different regions are explicitly modeled. Then, by using an adversarial approach, we learn the parameters of the different components of our model so that the overall distribution of the generated images matches the distribution of the dataset. We detail the generative process in the section 3.2, while the way we learn 𝙵 is detailed in Section 3.3.

3.2 Generative Process

We consider that images are produced by a generative process that operates in three steps: first, it defines the different regions in the image i.e the organization of the scene (composition step). Then, given this segmentation, the process generates the pixels for each of the regions independently (drawing step). At last, the resulting regions are assembled into the final image (assembling step).

Let us consider a scene composed of n-1 objects and one background we refer to as object n. Let us denote 𝐌k{0,1}W×H the mask corresponding to object k which associates one binary value to each pixels in the final image so that 𝐌x,yk=1 iff the pixel of coordinate (x,y) belongs to object k. Note that, since one pixel can only belong to one object, the masks have to satisfy k=1n𝐌x,yk=1 and the background mask 𝐌n can therefore easily be retrieved computed from the object masks as 𝐌n=1-k=1n-1𝐌k.

The pixel values of each object k are denoted 𝐕kW×H×C. Given that the image we generate is of size W×H×C, each object is associated with an image of the same size but only the pixels selected by the mask will be used to compose the output image. The final composition of the objects into an image is computed as follows:

𝐈k=1n𝐌k𝐕k.

To recap, the underlying generative process described previously can be summarized as follow: i) first, the masks 𝐌k are chosen together based on a mask prior p(𝐌). ii) Then, for each object independently, the pixel values are chosen based on a distribution p(𝐕k|𝐌k,k). iii) Finally, the objects are assembled into a complete image.

This process makes an assumption of independence between the colors and textures of the different objects composing a scene. While this is a naive assumption, as colorimetric values such as exposition, brightness, or even the real colors of two objects, are often related, this simple model still serves as a good prior for our purposes.

3.3 From Generative Process to Object Segmentation

Now, instead of considering a purely generative process where the masks are generated following a prior p(𝐌), we consider the inductive process where the masks are extracted directly from any input image 𝐈 through the function 𝙵 which is the object segmentation function described previously. The role of 𝙵 is thus to output a set of masks given any input 𝐈. The new generative process acts as follows: i) it takes a random image in the dataset and computes the masks using 𝙵(𝐈)𝐌1,,𝐌n, and ii) it generates new pixel values for the regions in the image according to a distribution p(𝐕k|𝐌k,k). iii) It aggregates the objects as before.

In order for output images to match the distribution of the training dataset, all the components (i.e 𝙵 and p(𝐕k|𝐌k,k)) are learned adversarially following the GAN approach. Let us define 𝙳:W×H×C a discriminator function able to classify images as fake or real. Let us denote 𝙶𝙵(𝐈,𝐳1,,𝐳n) our generator function able to compose a new image given an input image 𝐈, an object segmentation function 𝙵, and a set of vectors 𝐳1,,𝐳n each sampled independently following a prior p(𝐳) for each object k, background included. Since the pixel values of the different regions are considered as independent given the segmentation, our generator can be decomposed in n generators denoted 𝙶k(𝐌k,𝐳k), each one being in charge of deciding the pixel values for one specific region. The complete image generation process thus operates in three steps:

1) 𝐌1,,𝐌n𝙵(𝐈) (composition step)
2) 𝐕k𝙶k(𝐌k,𝐳k) for k{1,,n} (drawing step)
3) 𝙶𝙵(𝐈,𝐳1,,𝐳n)=k=1n𝐌k𝐕k (assembling step).

Provided the functions 𝙵 and 𝙶k are differentiable, they can thus be learned by solving the following adversarial problem:

min𝙶𝙵max𝙳=𝔼𝐈pdata[log𝙳(𝐈)]+𝔼𝐈pdata,𝐳1,𝐳np(𝐳)[log(1-𝙳(𝙶𝙵(𝐈,𝐳1,,𝐳n)))].

Therefore, in practice we have 𝙵 output soft masks in [0,1] instead of binary masks. Also, in line with recent GAN literature [3, 39, 50, 57], we choose to use the hinge version of the adversarial loss [35, 50] instead, and obtain the following formulation:

max𝙶𝙵𝙶=𝔼𝐈pdata,𝐳1,,𝐳np(𝐳)[𝙳(𝙶𝙵(𝐈,𝐳1,,𝐳n))]max𝙳𝙳=𝔼𝐈pdata[min(0,-1+𝙳(𝐈))]+𝔼𝐈pdata,𝐳1,,𝐳np(𝐳)[min(0,-1-𝙳(𝙶𝙵(𝐈,𝐳1,,𝐳n)))].

Still, as it stands, the learning process of this model may fail for two reasons. First, it does not have to extract a meaningful segmentation in regards to the input 𝐈. Indeed, since the values of all the output pixels will be generated, 𝐈 can be ignored entirely to generate plausible pictures. For instance, the segmentation could be the same for all the inputs regardless of input 𝐈. Second, it naturally converges to a trivial extractor 𝙵 that puts the whole image into a single region, the other regions being empty. We thus have to add additional constraints to our model.

Constraining mask extraction by redrawing a single region.

The first constraint aims at forcing the model to extract meaningful region masks instead of ignoring the image. To this end, we take advantage of the assumption that the different objects are independently generated. We can, therefore, replace only one region at each iteration instead of regenerating all the regions. Since the generator now has to use original pixel values from the image in the reassembled image, it cannot make arbitrary splits. The generation process becomes as follows:

1) 𝐌1,,𝐌n𝙵(𝐈) (composition step)
2) 𝐕k𝐈 for k{1,,n}{𝐢}
𝐕𝐢𝙶𝐢(𝐌𝐢,𝐳𝐢) (drawing step)
3) 𝙶𝙵(𝐈,𝐳𝐢,𝐢)=k=1n𝐌k𝐕k (assembling step),

where 𝐢 designates the index of the only region to redraw and is sampled from 𝒰(n), the discrete uniform distribution on {1,,n}. The new learning objectives are as follows:

max𝙶𝙵𝙶=𝔼𝐈pdata,𝐢𝒰(n),𝐳𝐢p(𝐳)[𝙳(𝙶𝙵(𝐈,𝐳𝐢,𝐢))]max𝙳𝙳=𝔼𝐈pdata[min(0,-1+𝙳(𝐈))]+𝔼𝐈pdata,𝐢𝒰(n),𝐳𝐢p(𝐳)[min(0,-1-𝙳(𝙶𝙵(𝐈,𝐳𝐢,𝐢)))].

Conservation of Region Information.

The second constraint is that given a region 𝐢 generated from a latent vector 𝐳𝐢, the final image 𝙶𝙵(𝐈,𝐳𝐢,𝐢) must contain information about 𝐳𝐢. This constraint is designed to prevent the mask extractor 𝙵 to produce empty regions. Indeed, if region 𝐢 is empty, i.e. 𝐌x,y𝐢=0 for all x,y, then 𝐳𝐢 cannot be retrieved from the final image. Equivalently, if 𝐳𝐢 can be retrieved, then region 𝐢 is not empty. This information conservation constraint is implemented through an additional term in the loss function. Let us denote δk a function which objective is to infer the value of 𝐳k given any image 𝐈. One can learn such a function simultaneously to promote conservation of information by the generator. This strategy is similar to the mutual information maximization used in InfoGAN. [9].

The final complete process is illustrated in Figure 1 and correspond to the following learning objectives:

max𝙶𝙵,δ𝙶=𝔼𝐈pdata,𝐢𝒰(n),z𝐢p(z)[𝙳(𝙶𝙵(𝐈,𝐳𝐢,𝐢))-λz||δ𝐢(𝙶𝙵(𝐈,𝐳𝐢,𝐢))-𝐳𝐢||22]max𝙳𝙳=𝔼𝐈pdata[min(0,-1+𝙳(𝐈)]+𝔼𝐈pdata,𝐢𝒰(n),𝐳𝐢p(𝐳)[min(0,-1-𝙳(𝙶𝙵(𝐈,𝐳𝐢,𝐢))],

where λz is a fixed hyper-parameter that controls the strength of the information conservation constraint. Note that the constraint is necessary for our model to find non trivial solutions, as otherwise, putting the whole image into a single region is both optimal and easy to discover for the neural networks. The final learning algorithm follows classical GAN schema [3, 17, 39, 57] by updating the generator and the discriminator alternatively with the update functions presented in Algorithm 1.

Figure 1: Example generation with 𝙶𝐟(𝐈,𝐳𝐢,𝐢) with 𝐢=1 and n=2. Learned functions are in color.
{algorithm}

Networks update functions {algorithmic}[1] \ProcedureGeneratorUpdate \Statesample data 𝐈pdata, \Statesample region 𝐢𝚄𝚗𝚒𝚏𝚘𝚛𝚖({1,,n}) \Statesample noise vector 𝐳𝐢p(𝐳) \State𝐈gen𝙶𝐟(𝐈,𝐳𝐢,𝐢) \Commentgenerate image \Statez-||δ𝐢(𝐈gen)-𝐳𝐢||2 \Commentcompute information conservation loss \StateG𝙳(𝐈gen) \Commentcompute adversarial loss \Stateupdate 𝙶𝐟 with 𝙶𝐟[G+z] \Stateupdate δ𝐢 with δ𝐢z \EndProcedure\ProcedureDiscriminatorUpdate \Statesample datapoints 𝐈real,𝐈inputpdata \Statesample region 𝐢𝚄𝚗𝚒𝚏𝚘𝚛𝚖({1,,n}) \Statesample noise vector 𝐳𝐢p(𝐳) \State𝐈gen𝙶𝐟(𝐈input,𝐳𝐢,𝐢) \Commentgenerate image \StateDmin(0,-1+𝙳(𝐈real))+min(0,-1-𝙳(𝐈gen) \Commentcompute adversarial loss \Stateupdate 𝙳 with 𝙳D \EndProcedure

4 Implementation

We now provide some information about the architecture of the different components (additional details are given in Supplementary materials). As usual with GAN-based methods, the choice of a good architecture is crucial. We have chosen to build on the GAN and the image segmentation literature and to take inspiration from the neural network architectures they propose.

For the mask generator 𝙵, we use an architecture inspired by PSPNet [58]. The proposed architecture is a fully convolutional neural network similar to one used in image-to-image translation [60], to which we add a Pyramid Pooling Module [58] whose goal is to gather information on different scales via pooling layers. The final representation of a given pixel is thus encouraged to contain local, regional, and global information at the same time.

The region generators 𝙶k, the discriminator 𝙳 and the network δ that reconstructs 𝐳 are based on SAGAN [57] that is frequently used in recent GAN literature [3, 37]. Notably, we use spectral normalization [39] for weight regularization for all networks except for the mask provider F, and we use self-attention [57] in 𝙶k and 𝙳 to handle non-local relations. To both promote stochasticity in our generators and encourage our latent code 𝐳 to encode for texture and colors, we also use conditional batch-normalization in 𝙶k. The technique has emerged from style modeling for style transfer tasks [12, 43] and has since been used for GANs as a mean to encode for style and to improve stochasticity [1, 8, 54]. All parameters of the different δk functions are shared except for their last layers.

As it is standard practice for GANs [3], we use orthogonal initialization [47] for our networks and ADAM [31] with β=(0,.9) as optimizer. Learning rates are set to 10-4 except for the mask network 𝙵 which uses a smaller value of 10-5. We sample noise vectors 𝐳𝐢 of size 32 (except for MNIST where we used vectors of size 16) from 𝒩(0,Id) distribution. We used mini-batches of size 25 and ran each experiment on a single NVidia Tesla P100 GPU. Despite our conservation of information loss, the model can still collapse into generating empty masks at the early steps of the training. While the regularization does alleviate the problem, we suppose that the mask generator 𝙵 can collapse even before the network δ learns anything relevant and can act as a stabilizer. As the failures happen early and are easy to detect, we automatically restart the training should the case arise.

We identified λz and the initialization scheme as critical hyper-parameters and focus our hyper-parameters search on those. More details, along with specifics of the implementation used in our experiments are provided as Supplementary materials. The code, dataset splits and pre-trained models are also available open-source 11 1 https://github.com/mickaelChen/ReDO.

5 Experiments

5.1 Datasets

We present results on three natural image datasets and one toy dataset. All images have been resized and then cropped to 128×128.

Flowers dataset [40, 41] is composed of 8189 images of flowers. The dataset is provided with a set of masks obtained via an automated method built specifically for flowers [40]. We split into sets of 6149 training images, 1020 validation and 1020 test images and use the provided masks as ground truth for evaluation purpose only.

Labeled Faces in the Wild dataset [25, 33] is a dataset of 13233 faces. A subpart of the funneled version [24] has been segmented and manually annotated [27], providing 2927 groundtruth masks. We use the non-annotated images for our training set. We split the annotated images between validation and testing sets so that there is no overlap in the identity of the persons between both sets. The test set is composed of 1600 images, and the validation set of 1327 images.

The Caltech-UCSD Birds 200 2011 (CUB-200-2011) dataset [53] is a dataset containing 11788 photographs of birds. We use 10000 images for our training split, 1000 for the test split, and the rest for validation.

As a sanity check, we also build a toy dataset colored-2-MNIST in which each sample is composed of an uniform background on which we draw two colored MNIST [34] numbers: one odd number and one even number. Odd and even numbers have colors sampled from different distributions so that our model can learn to differentiate them. For this dataset, we set n=3 as there are three components.

As an additional experiment, we also build a new dataset by fusing Flowers and LFW datasets. This new Flowers+LFW dataset has more variability, and contains different type of objects. We used this dataset to demonstrate that ReDO can work without label information on problems with multiple categories of objects.

Figure 2: Generated samples (not cherry-picked, zoom in for better visibility). For each dataset, the columns are from left to right: 1) input images, 2) ground truth masks, 3) masks inferred by the model for object one, 4-7) generation by redrawing object one, 8-11) generation by redrawing object two. As we keep the same 𝐳𝐢 on any given column, the color and texture of the redrawn object is kept constant across rows. More samples are provided in Supplementary materials. Faces from the LFW dataset have been anonymized, in vizualisations only, to protect personality rights.

5.2 Results

To evaluate our method ReDO, we use two metrics commonly used for segmentation tasks. The pixel classification accuracy (Acc) measures the proportion of pixels that have been assigned to the correct region. The intersection over union (IoU) is the ratio between the area of the intersection between the inferred mask and the ground truth over the area of their union. In both cases, higher is better. Because ReDO is unsupervised and we can’t control which output region corresponds to which object or background in the image, we compute our evaluation based on the regions permutation that matches the ground truth the best. For model selection, we used IoU computed on a held out labeled validation set. When available, we present our evaluation on both the training set and a test set as, in an unsupervised setting, both can be relevant depending on the specific use case. Results are presented in Table 1 and show that ReDO achieves reasonable performance on the three real-world datasets.

We also compared the performance of ReDO, which is unsupervised, with a supervised method, keeping the same architecture for 𝙵 in both cases. We analyze how many training samples are needed to reach the performance of the unsupervised model (see Figure 4). One can see that the unsupervised results are in the range of the ones obtained with a supervised method, and usually outperform supervised models trained with less than around 50 or 100 examples depending on the dataset. For instance, on the LFW Dataset, the unsupervised model obtains about 92% of accuracy and 79% IoU and the supervised model needs 50-60 labeled examples to reach similar performance.

We provide random samples of extracted masks (Figure 2) and the corresponding generated images with a redrawn object or background. Note that our objective is not to generate appealing images but to learn an object segmentation function. Therefore, ReDO generates images that are less realistic than the ones generated by state-of-the-art GANs. Focus is, instead, put on the extracted masks, and we can see the good quality of the obtained segmentation in many cases. Best and worst masks, as well as more random samples, are displayed in Supplementary materials.

We also trained ReDO on the fused Flowers+LFW dataset without labels. We re-used directly the hyper-parameters we have used to fit the Flowers dataset without further tuning and obtained, as preliminary results, a reasonable accuracy of 0.856 and an IoU of 0.691. This shows that ReDO is able to infer class information from masks even in a fully unsupervised setup. Samples are displayed in Figure 3.

Figure 3: Results on LFW + Flowers dataset, arranged as in Figure 2. As 𝐳 is kept constant on a column across all rows, we can observe that 𝐳 codes for different textures depending on the class of the image even though the generator is never given this information explicitly. Faces from the LFW dataset have been anonymized, in vizualisations only, to protect personality rights.
Dataset Train Acc Train IoU Test Acc Test IoU
LFW - - 0.917 ± 0.002 0.781 ± 0.005
CUB 0.840 ± 0.012 0.423 ± 0.023 0.845 ± 0.012 0.426 ± 0.025
Flowers* 0.886 ± 0.008 0.780 ± 0.012 0.879 ± 0.008 0.764 ± 0.012
Flowers+LFW - - 0.856 0.691
Table 1: Performance of ReDO in accuracy (Acc) and intersection over union (IoU) on retrieved masks. Means and standard deviations are based on five runs with fixed hyper-parameters. LWF train set scores are not available since we trained on unlabeled images. *Please note that segmentations provided along the original Flowers dataset [41] have been obtained using an automated method. We display samples with top disagreement masks between ReDO and ground truth in Supplementary materials. In those cases, we find ours to provide better masks.
Figure 4: Comparison with supervised baseline as a function of the number of available training samples.

6 Conclusion

We presented a novel method called ReDO for unsupervised learning to segment images. Our proposal is based on the assumption that if a segmentation model is accurate, then one could edit any real image by replacing any segmented object in a scene by another one, randomly generated, and the result would still be a realistic image. This principle allows casting the unsupervised learning of image segmentation as an adversarial learning problem. Our experimental results obtained on three datasets show that this principle works. In particular, our segmentation model is competitive with supervised approaches trained on a few hundred labeled examples.

Our future work will focus on handling more complex and diverse scenes. As mentioned in Section 2, our model could generalize to an arbitrary number of objects and objects of unknown classes via iterative design and/or class agnostic generators. Currently, we are mostly limited by our ability to effectively train GANs on those more complicated settings but rapid advances in image generation [3, 29, 37] make it a reasonable goal to pursue in a near future. Meanwhile, we will be investigating the use of the model in a semi-supervised or weakly-supervised setup. Indeed, additional information would allow us to guide our model for harder datasets while requiring fewer labels than fully supervised approaches. Conversely, our model could act as a regularizer by providing a prior for any segmentation tasks.

Acknowledgments

This work was supported by the French National Research Agency projects LIVES (grant number ANR-15-CE23-0026-03) and "Deep in France" (grant number ANR-16-CE23-0006).

References

  • [1] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and A. Courville (2018) Augmented cyclegan: learning many-to-many mappings from unpaired data. In International Conference on Machine Learning, pp. 195–204. Cited by: §4.
  • [2] P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik (2014) Multiscale combinatorial grouping. In Computer Vision and Pattern Recognition, Cited by: §2.
  • [3] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §3.3, §3.3, §4, §4, §6.
  • [4] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) MONet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §1, §2.
  • [5] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §1, §1, §2.
  • [6] M. Chen, L. Denoyer, and T. A. res (2018) Multi-view data generation without view supervision. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [7] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
  • [8] T. Chen, M. Lucic, N. Houlsby, and S. Gelly (2019) On self modulation for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §4.
  • [9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2, §3.3.
  • [10] E. L. Denton and v. Birodkar (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4414–4423. External Links: Link Cited by: §2.
  • [11] C. Donahue, A. Balsubramani, J. McAuley, and Z. C. Lipton (2018) Semantically decomposing the latent spaces of generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [12] V. Dumoulin, J. Shlens, and M. Kudlur (2017) A learned representation for artistic style. Proc. of ICLR 2. Cited by: §4.
  • [13] T. Durand, T. Mordan, N. Thome, and M. Cord (2017) WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [14] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. (2016) Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §1, §2.
  • [15] B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J. Meent (2019) Structured disentangled representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2525–2534. Cited by: §2.
  • [16] Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1180–1189. Cited by: §2.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.3.
  • [18] K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §1, §2.
  • [19] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015) DRAW: a recurrent neural network for image generation. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1462–1471. Cited by: §2.
  • [20] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §1, §2.
  • [21] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Link Cited by: §2.
  • [22] K. Hsu, Y. Lin, and Y. Chuang (2018) Co-attention cnns for unsupervised object co-segmentation.. In IJCAI, pp. 748–756. Cited by: §2.
  • [23] K. Hsu, Y. Lin, and Y. Chuang (2019) DeepCO 3: deep instance co-segmentation by co-peak search and co-saliency detection. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [24] G. B. Huang, V. Jain, and E. Learned-Miller (2007) Unsupervised joint alignment of complex images. In ICCV, Cited by: §5.1.
  • [25] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007-10) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §5.1.
  • [26] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information distillation for unsupervised image segmentation and clustering. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [27] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller (2013) Augmenting CRFs with Boltzmann machine shape priors for image labeling. In CVPR, University of Massachusetts Amherst and University of Michigan Ann Arbor. Cited by: §5.1.
  • [28] A. Kanezaki (2018) Unsupervised image segmentation by backpropagation. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §2.
  • [29] T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §6.
  • [30] H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §2.
  • [31] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [32] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al. (2017) Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: §2.
  • [33] G. B. H. E. Learned-Miller (2014-05) Labeled faces in the wild: updates and new reporting procedures. Technical report Technical Report UM-CS-2014-003, University of Massachusetts, Amherst. Cited by: §5.1.
  • [34] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.1.
  • [35] J. H. Lim and J. C. Ye (2017) Geometric gan. arXiv preprint arXiv:1705.02894. Cited by: §3.3.
  • [36] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §2.
  • [37] M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly (2019) High-fidelity image generation with fewer labels. arXiv preprint arXiv:1903.02271. Cited by: §4, §6.
  • [38] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048. Cited by: §2.
  • [39] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.3, §3.3, §4.
  • [40] M. Nilsback and A. Zisserman (2007) Delving into the whorl of flower segmentation.. In BMVC, Vol. 2007, pp. 1–10. Cited by: §5.1.
  • [41] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §5.1, Table 1.
  • [42] P. Ostyakov, R. Suvorov, E. Logacheva, O. Khomenko, and S. I. Nikolenko (2018) SEIGAN: towards compositional image generation by simultaneously learning to segment, enhance, and inpaint. arXiv preprint arXiv:1811.07630. Cited by: §2.
  • [43] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.
  • [44] T. T. Pham, T. Do, N. Sünderhauf, and I. Reid (2018) SceneCut: joint geometric and object segmentation for indoor scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §2.
  • [45] T. Remez, J. Huang, and M. Brown (2018) Learning to segment via cut-and-paste. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 37–52. Cited by: §2.
  • [46] C. Rother, T. Minka, A. Blake, and V. Kolmogorov (2006) Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1, pp. 993–1000. Cited by: §2.
  • [47] A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, External Links: Link Cited by: §4.
  • [48] O. Sbai, C. Couprie, and M. Aubry (2018) Vector image generation by learning parametric layer decomposition. arXiv preprint arXiv:1812.05484. Cited by: §2.
  • [49] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Cited by: §2.
  • [50] D. Tran, R. Ranganath, and D. M. Blei (2017) Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896 7. Cited by: §3.3.
  • [51] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §2.
  • [52] S. van Steenkiste, K. Kurach, and S. Gelly (2018) A case for object compositionality in deep generative models of images. arXiv preprint arXiv:1810.10340. Cited by: §2.
  • [53] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §5.1.
  • [54] X. Wang and A. Gupta (2016) Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pp. 318–335. Cited by: §4.
  • [55] X. Xia and B. Kulis (2017) W-net: a deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506. Cited by: §1, §2.
  • [56] J. Yang, A. Kannan, D. Batra, and D. Parikh (2017) LR-GAN: layered recursive generative adversarial networks for image generation. In 5th International Conference on Learning Representations, ICLR 2017, External Links: Link Cited by: §1, §2.
  • [57] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §3.3, §3.3, §4.
  • [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §1, §2, §4.
  • [59] Y. Zhou, Y. Zhu, Q. Ye, Q. Qiu, and J. Jiao (2018-06) Weakly supervised instance segmentation using class peak response. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [60] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §4.

Supplementary Material for Unsupervised Object Segmentation by Redrawing

We provide architectural details, hyper-parameters discussion and additional samples output of our model.

Architectural details

mask network 𝚏(𝐈) nonlinearities output size
Image 𝐈 3x128x128
Conv 7x7 (reflect. pad 3) Instance Norm, ReLU 16x128x128
Conv 3x3 (stride 2, pad 1) Instance Norm, ReLU 32x64x64
Conv 3x3 (stride 2, pad 1) Instance Norm, ReLU 64x32x32
Residual Bloc (Instance Norm, ReLU) 64x32x32
Residual Bloc (Instance Norm, ReLU) 64x32x32
Residual Bloc (Instance Norm, ReLU) 64x32x32
Pyramid Pooling Module 68x32x32
Upsample 68x64x64
Conv 3x3 (pad 1) Instance Norm, ReLU 34x64x64
Upsample 34x128x128
Conv 3x3 (pad 1) Instance Norm, ReLU 17x128x128
Conv 3x3 (reflect. pad 3) sigmoid (if n=2) or softmax nx128x128
Table 2: Architecture of mask network f. Overall architecture is similar to Cycle-GAN for image translation, except with less residual blocs but a Pyramid Pooling Module introduced by PSPNet.
discriminator network 𝙳 and encoder δ output size
image input 𝐈 3x128x128
Down Res Bloc (ReLU) 64x64x64
Self-Attention Bloc
Down Res Bloc (ReLU) 128x32x32
Down Res Bloc (ReLU) 256x16x16
Down Res Bloc (ReLU) 512x8x8
Down Res Bloc (ReLU) 1024x4x4
Res Bloc (ReLU) 1024x4x4
Spatial sum pooling 1024x1x1
Linear 1 for 𝙳, 32 for δ
Table 3: Architecture of discriminator network 𝙳 and encoder δ.

Hyperparameters

We discuss some notable hyper-parameters and architectural choices. We identified λz and the initialization scheme as critical hyper-parameters and focus our search on those, in addition to the standard learning rates search. The number of channels in our generators is also important.

  • Learning rates are set to 10-4 except for mask network for which we use a smaller learning rate (10-5). We searched for learning rates independently for each neural network.

  • Batch size were set between 20 and 25, as much as we could fit on a single Nvidia Tesla P100 GPU.

  • We chose smaller sizez for each 𝐳k (16 for c2-MNIST and 32 for the other datasets) than what is usually found in GAN literature so that the vectors could be reasonably retrieved by δ.

  • As expected, we found that λz were important to tune for the stability of our training procedure. We use λz=5n*sizez for all datasets except LFW where λz=15n*sizez.

  • Adequate orthogonal initialization is critical. We found that our model worked best when initialized with a gain at 0.8, and not at all when set at .2 and 1.4.

  • We tested smaller numbers of channels for our generators 𝙶k since each generator only have to model a specific type of object. We still use the same number (ch=64) as the other networks for LFW and colored-2-MNIST but reduced to ch=32 for CUB and Flowers.

  • We used spectral normalization for all our networks except the mask network on which a weight decay of 10-4 is applied instead.

  • The use of pyramid pooling produce masks of significantly better quality that standard residual network.

  • The use of self-attention in both D and G might not be mandatory but more experiments would be needed to conclude.

  • We found that having different ratio of discriminator updates and generator updates didn’t help.

Additional output masks

Figure 5: Masks obtained for images from the Flowers test set. Each bloc of three rows depict from top to bottom input image, ground truth and output of our model. Each bloc from top to bottom: 1) top masks according to accuracy 2) top masks according to IoU 3-4) randomly sampled masks 5) worst masks for accuracy 6) worst masks for IoU. Because the ground truth for Flowers were obtained via an automated process, our model actually provide better predictions in worst agreement case.
Figure 6: Masks obtained for images from the CUB test set. Each bloc of three rows depict from top to bottom input image, ground truth and output of our model. Each bloc from top to bottom: 1) top masks according to accuracy 2) top masks according to IoU 3-4) randomly sampled masks 5) worst masks for accuracy 6) worst masks for IoU.
Figure 7: More randomly sampled output masks on Flower dataset.
Figure 8: More randomly sampled output masks on CUB dataset.