Bi-Directional Domain Translation for Zero-Shot Sketch-Based Image Retrieval

  • 2019-11-29 17:43:45
  • Jiangtong Li, Zhixin Ling, Li Niu, Liqing Zhang
  • 0

Abstract

The goal of Sketch-Based Image Retrieval (SBIR) is using free-hand sketchesto retrieve images of the same category from a natural image gallery. However,SBIR requires all categories to be seen during training, which cannot beguaranteed in real-world applications. So we investigate more challengingZero-Shot SBIR (ZS-SBIR), in which test categories do not appear in thetraining stage. Traditional SBIR methods are prone to be category-basedretrieval and cannot generalize well from seen categories to unseen ones. Incontrast, we disentangle image features into structure features and appearancefeatures to facilitate structure-based retrieval. To assist featuredisentanglement and take full advantage of disentangled information, we proposea Bi-directional Domain Translation (BDT) framework for ZS-SBIR, in which theimage domain and sketch domain can be translated to each other throughdisentangled structure and appearance features. Finally, we perform retrievalin both structure feature space and image feature space. Extensive experimentsdemonstrate that our proposed approach remarkably outperforms state-of-the-artapproaches by about 8% on the Sketchy dataset and over 5% on the TU-Berlindataset.

 

Quick Read (beta)

Bi-Directional Domain Translation for Zero-Shot Sketch-Based Image Retrieval

Jiangtong Li
Shanghai Jiao Tong University
[email protected]
   Zhixin Ling
Shanghai Jiao Tong University
[email protected]
   Li Niu
Shanghai Jiao Tong University
[email protected]
   Liqing Zhang
Shanghai Jiao Tong University
[email protected]
Abstract

The goal of Sketch-Based Image Retrieval (SBIR) is using free-hand sketches to retrieve images of the same category from a natural image gallery. However, SBIR requires all categories to be seen during training, which cannot be guaranteed in real-world applications. So we investigate more challenging Zero-Shot SBIR (ZS-SBIR), in which test categories do not appear in the training stage. Traditional SBIR methods are prone to be category-based retrieval and cannot generalize well from seen categories to unseen ones. In contrast, we disentangle image features into structure features and appearance features to facilitate structure-based retrieval. To assist feature disentanglement and take full advantage of disentangled information, we propose a Bi-directional Domain Translation (BDT) framework for ZS-SBIR, in which the image domain and sketch domain can be translated to each other through disentangled structure and appearance features. Finally, we perform retrieval in both structure feature space and image feature space. Extensive experiments demonstrate that our proposed approach remarkably outperforms state-of-the-art approaches by about 8% on the Sketchy dataset and over 5% on the TU-Berlin dataset.

1 Introduction

In recent years, with the rapid growth of multimedia data on the internet, image retrieval is playing a more and more important role in many fields, such as remote sensing and e-commerce. Since sketch can be easily drawn and reveal the characteristics of the target images, sketch-based image retrieval (SBIR), which uses a sketch to retrieve the images of the same category, has become widely accepted among users. Therefore, SBIR has also attracted widespread attention in research community [9, 3, 14, 15, 2, 21, 68, 20, 1, 46, 24, 58, 50, 34, 63, 48, 51, 40]. In the conventional setting, it is assumed that the images and sketches in training and test sets share the same set of categories. However, in real-world applications, the categories of test sketches/images may be out of the scope of training categories.

Figure 1: For both seen categories and unseen categories, we visualize the feature of a query sketch (red star) and image features from different categories (points of different colors) obtained by SBIR method SaN [64], together with the query sketch and an image from the same category.

In this paper, we focus on a more challenging task called zero-shot sketch-based image retrieval (ZS-SBIR) [52], which assumes that test categories do not appear in the training stage. In the remainder of this paper, we refer to training (resp., test) categories as seen (resp., unseen) categories [11]. Traditional SBIR methods suffer from sharp performance drop in ZS-SBIR setting [62], probably because they are prone to learn category-based retrieval. Specifically, based on the analysis in [62], since the evaluation methodology is category-based, traditional SBIR methods may take a shortcut by correlating sketches/images with their category labels and retrieving the images from the same category as the query sketch, which is very effective when test data share the same categories as training data. However, SBIR methods often fail when the test categories are not present in the training stage. As illustrated in Figure 1, based on the pairwise distance in feature space, SBIR method SaN [64] succeeds in retrieving the images from a seen category “giraffe” when given a query “giraffe” sketch, but fails on an unseen category “church”. We conjecture that to generalize well from seen categories to unseen categories, a model should learn to align the structure information (e.g., outline, shape) of sketches with the corresponding structure information of images (e.g., the structure of church spire in Figure 1), which is referred to as structure-based retrieval and ignored by traditional SBIR methods like [64].

Existing ZS-SBIR methods can be categorized into three groups: (1) using a generative model based on aligned sketch-image pairs (a sketch is drawn based on a given image and thus has roughly the same outline as this image) to reduce the gap between seen and unseen categories [62]; (2) employing semantic information to reduce the intra-class variance in sketches to stabilize the training process [60, 59, 12, 52]; (3) fine-tuning the pre-trained model in ZS-SBIR task with semantic-aware knowledge preservation to prevent catastrophic forgetting [37]. However, the aligned sketch-image pairs and semantic information are not always available. Moreover, most of the above methods did not achieve the goal of structure-based retrieval. The method in [62] made an attempt at structure-based retrieval but did not explicitly extract structure information from images. In terms of the extraction of image structure information, some prior works relied on sketch tokens, which are obtained by extracting the outlines of images [36, 58, 63]. However, the sketch tokens obtained in this way are not very reliable due to the noisy and redundant information, which significantly limits the performance of these methods.

In this work, to facilitate structure-based retrieval, we disentangle image features into structure features and appearance features, where the former encode the structure information (i.e., outline, shape) and the latter encode the additional detailed information (i.e., color, texture). To assist feature disentanglement and take full advantage of disentangled information, we propose Bi-directional Domain Translation (BDT) framework, where sketches and images are deemed as two domains. As shown in Figure 2, we first use a pre-trained model to extract features from sketches (resp., images), which are dubbed as sketch (resp., image) features. Then, the image features are disentangled into structures features and appearance features, while the sketch features are also projected to the shared structure feature space. Furthermore, bi-directional domain translation is performed through the structure features and appearance features. Concretely, for image-to-sketch translation, we project image features to structure features and then generate sketch features. For sketch-to-image translation, we project sketch features to structure features, which are combined with variational appearance features to compensate the uncertainty when we generate image features from sketch features.

Finally, we perform retrieval in both structure feature space and image feature space, to combine the best of two worlds. The effectiveness of our proposed BDT framework is verified by comprehensive experimental results on two benchmark datasets. Our main contributions are summarized as follows:

  • To the best of our knowledge, we are the first to disentangle image features into structure features and appearance features to facilitate structure-based retrieval.

  • We propose a bi-directional domain translation framework for zero-shot sketch-based image retrieval task.

  • Comprehensive results on two popular large-scale datasets show that our framework significantly outperforms the state-of-the-art methods.

2 Related Work

2.1 SBIR and ZS-SBIR

The main goal of sketch-based image retrieval (SBIR) is to bridge the gap between image domain and sketch domain. Basically, previous SBIR methods can be categorized into hand-crafted feature based methods and deep learning based methods. Before deep learning was introduced to this task, hand-crafted based methods generally extracted the edge maps from natural images and then matched them with sketches using hand-craft feature [50, 20, 15, 21, 14]. In recent years, deep learning based methods have become popular in this area. To reduce the gap between image domain and sketch domain, variants of siamese network [48, 51, 56] and ranking loss [8, 51] were adopted to this task. Besides, semantic information and adversarial loss were also introduced to preserve the domain invariant information [4].

Zero-shot sketch-based image retrieval (ZS-SBIR) was proposed by [52] and then followed by [62, 60, 59, 37, 12]. To reduce the intra-class variance in sketches and stabilize the training process, semantic information was leveraged in [59, 52, 60, 12]. To reduce the gap between seen and unseen categories, a generative model along with aligned data pairs, was proposed in [62]. To adapt the pre-trained model to ZS-SBIR without forgetting the knowledge of ImageNet [10], semantic-aware knowledge preservation mechanism was used in [37]. However, none of the above methods attempted to disentangle images into structure information and appearance information, which is explored in this work.

2.2 Disentangled Representation

Disentangled representation learning aims to divide the latent representation into multiple units, with each unit corresponding to one latent factor (e.g., position, scale, identity). Each unit is only affected by its corresponding latent factor, but not influenced by other latent factors. Disentangled representations are more generalizable and semantically meaningful, and thus useful for a variety of tasks.

Disentangled representation learning methods can be categorized into unsupervised methods and supervised methods according to whether supervision for latent factors is available. For unsupervised disentanglement, abundant methods have been developed, including InfoGAN [6], MTAN [38], β-VAE [19], JointVAE [11], FactorVAE [26], InfoVAE [66] and TCVAE [5]. Most of them encouraged statistical independence across different dimensions of the latent representation while maintaining the mutual information between input data and latent representations. For supervised disentanglement, Kingma et al. [30] used disentangled representation to enhance semi-supervised learning. Zheng et al. [67] proposed DG-Net to integrate discriminative and generative learning using disentangled representation. Besides, supervised disentanglement has been applied to different tasks, like person re-id [67], face recognition [35, 39, 53, 57], and image generation [41, 61, 43, 25]. Our work is the first to apply disentangled representation learning to sketch-based image retrieval task.

2.3 Domain Translation

Many domain translation approaches, like Pix2Pix [23], CycleGAN [69], BiCycleGAN [70], StarGAN [7], DiscoGAN [27] have been proposed, which can translate between two domains (e.g., sketch domain and image domain). In this subsection, we mainly discuss the domain translation methods [32, 33, 22, 17] based on disentangled representation. Overall speaking, they disentangle latent representation into domain-specific representation and domain-invariant representation. In our problem, structure (resp., appearance) features can be treated as domain-invariant (resp., specific) representation. The translation between two domains in previous works [32, 33, 22, 17] is generally symmetric. In contrast, the translation between sketch domain and image domain is asymmetric because image domain has additional domain-specific representation compared with sketch domain.

Figure 2: An overview of our framework. We first adopt VGG-16 [54] to extract features from image and sketch. Then we disentangle image feature into appearance feature and structure feature, through which bi-directional domain translation is performed between image feature space and sketch feature space.

3 Methodology

In this section, we introduce our proposed Bi-directional Domain Translation (BDT) framework for zero-shot sketch-based image retrieval. In Sec 3.1, we state the problem definition. In Sec 3.2, we elaborate disentangled representation and bi-directional domain translation in detail. In Sec 3.3, we discuss the strategy during training and retrieval.

3.1 Problem Definition

In this paper, we focus on sketch-based image retrieval under zero-shot setting, where only the sketches and images from seen categories are used in the training stage. In the test stage, our proposed framework is expected to use the sketches to retrieve the images, the categories of which are unseen during training.

Formally, we are given a sketch dataset Ssk={(𝐱isk,yi)|yi𝒴} and an image dataset Sim={(𝐱jim,yj)|yj𝒴}, where 𝒴 is category label set, and (𝐱isk,yi) and (𝐱jim,yj) represent the sketches and images with their corresponding category labels respectively. Following the zero-shot setting in [62, 59], we split all categories 𝒴 into 𝒴tr and 𝒴te, in which no overlap exists between two label sets, i.e., 𝒴tr𝒴te=. Based on the partition of label set 𝒴, we can split the sketch (resp., image) dataset into Ssktr and Sskte (resp., Simtr and Simte). In the training stage, our model can only process the data in Ssktr and Simtr. During testing, given a sketch 𝐱sk from Sskte, our model needs to retrieve the images belonging to the same category as 𝐱sk from test images gallery Simte.

The overall framework of our method is illustrated in Figure 2. We input a triplet containing a pair of sketch 𝐱sk and image 𝐱im from the same category and another image 𝐱im- from a different category. First, a pre-trained model extracts features 𝐟sk (resp., 𝐟im and 𝐟im-) from 𝐱sk (resp., 𝐱im and 𝐱im-). Then, image features 𝐟im (resp., 𝐟im-) are disentangled into appearance features 𝐟imap (resp., 𝐟im-ap) and structure features 𝐟imst (resp., 𝐟im-st). We employ a ranking loss on (𝐟im-st, 𝐟imst, 𝐟skst) as well as an orthogonal loss on (𝐟imst, 𝐟imap) to disentangle appearance features and structure features. Furthermore, we use image structure features 𝐟imst to reconstruct sketch features 𝐟sk by using a reconstruction loss and an adversarial loss, because 𝐱sk and 𝐱im belong to the same category. Similarly, we can use sketch structure features 𝐟skst along with 𝐟imap to reconstruct 𝐟im. To support stochastic sampling in the test stage, we use 𝐟imap to infer variational appearance features 𝐳imap, which is combined with 𝐟skst to reconstruct 𝐟im. In the test stage, given an image (resp., sketch), we can obtain its structure feature as well as reconstructed (resp., generated) image feature, so that an image and a sketch can be compared in both structure feature space and image feature space.

3.2 Our Framework

3.2.1 Feature Extractor

Since sketches are highly abstract and visually sparse compared with natural images, it is hard to obtain adequate information from sketches when using a pre-trained model as feature extractor. To tackle this problem without using more parameters, we adopt the fusion strategy in [59] to concatenate the features extracted from multiple layers of the pre-trained model for both images and sketches.

In detail, we first use a pre-trained backbone model, i.e., VGG-16 pre-trained on ImageNet [10], to process each sketch and image. Suppose 𝐅i is the output feature of the i-th convolution layer and 𝐟fc is the output feature of the last fully connected layer, the final feature 𝐟 can be obtained by concatenating 𝐟fc and global average pooling (GAP) of 𝐅i:

𝐟=[𝐟fc,GAP(𝐅5),GAP(𝐅4),GAP(𝐅3)]. (1)

3.2.2 Disentangled Representation

To achieve the goal of structure-based retrieval, we tend to disentangle structure information from image feature. Given an image feature 𝐟im, we adopt two image encoders Eimap and Eimst to disentangle image feature 𝐟im into image structure feature 𝐟imst and image appearance feature 𝐟imap. Besides, to project sketch feature 𝐟sk to the same structure feature space as 𝐟imst, a sketch encoder Eskst is adopted to obtain sketch structure feature 𝐟skst. The above process is formulated as follows,

𝐟imap=Eimap(𝐟im);𝐟imst=Eimst(𝐟im);𝐟skst=Eskst(𝐟sk). (2)

In each training step, apart from sampling a positive sketch-image pair (𝐟sk, 𝐟im) of the same category, we also sample a negative image 𝐟im-, which belongs to other categories. Therefore, a triple (𝐟sk, 𝐟im, 𝐟im-) is fed into the network. We expect that the structure features of images and sketches are in the same feature space. Moreover, in the structure feature space shared by sketch and image, we expect intra-class coherence and inter-class separability across different domains (i.e., sketch domain and image domain). Specifically, we expect to pull sketches close to the images of the same category and push sketches apart from the images of a different category. With the above purpose, we employ a ranking loss with L2 distance:

rk=||𝐟skst-𝐟imst||2+max(0,m-||𝐟skst-𝐟im-st||2), (3)

in which the margin m is empirically set as 10.0 in our experiments.

After enforcing the structure features of images to share the same structure feature space of sketches, we further expect that the appearance features of images only contain complementary information (e.g., color, texture) to the structure features. To ensure that the image feature are disentangled in the structure feature space and appearance feature space, we impose an orthogonal constraint on the structure features and appearance features of images based on cosine similarity:

or=cos(𝐟imap,𝐟imsk)=𝐟imap𝐟imst||𝐟imap||2||𝐟imap||2, (4)

where means the the dot product between two vectors. Note that the 𝐟imap and 𝐟imsk are the output of ReLU activation, so cos(,) is always non-negative and minimizing (4) will push cos(,) towards zero.

3.2.3 Bi-directional Domain Translation

To further help learn disentangled representations and fully utilize the disentangled image features, we perform bi-directional domain translation between sketch domain and image domain.

For image-to-sketch translation, we employ a decoder Gsk to reconstruct sketch feature 𝐟sk based on 𝐟imst, considering that 𝐟sk and 𝐟im belong to the same category. By denoting 𝐟^sk=Gsk(𝐟imst), we adopt a reconstruction loss ||𝐟sk-𝐟^sk||2. Furthermore, we employ an adversarial loss to guarantee that the distribution of generated sketch features is close to that of real sketch features. The adversarial loss is implemented based on a discriminator Dsk, which distinguishes generated sketch features from real ones. Thus, the total loss of image-to-sketch translation can be written as

tlsk=adsk+resk=-log(Dsk(𝐟^sk))+||𝐟sk-𝐟^sk||2. (5)

For sketch-to-image translation, we tend to use the sketch structure features to reconstruct image features from the same category. However, images contain extra appearance information (e.g., color, texture) compared with sketches, so it is necessary to compensate for the appearance uncertainty when translating from structure features to image features. Therefore, image appearance features should be integrated with sketch structure features to reconstruct image features.

In the test stage, given a sketch, we also hope to generate its imaginary image feature to enable retrieval in the image feature space. Nevertheless, we do not have the corresponding image appearance features in this case. One commonly used solution is stochastic sampling during testing. We introduce a variational estimator Vimap to approximate the variational Gaussian distribution Q(𝐳imap|𝐟imap) based on 𝐟imap , that is, (𝝁imap,𝝈imap)=Vimap(𝐟imap). Then, we use Kullback-Leibler divergence to force Q(𝐳imap|𝐟im) to be close to prior distribution 𝒩(𝟎,𝟏):

kl=DKL(𝒩(𝝁imap,𝝈imap)||𝒩(𝟎,𝟏)). (6)

After using reparameterization trick [29] to sample variational appearance feature 𝐳imap, i.e., 𝐳imap=𝝁imap+ϵ𝝈imap, where ϵ is sampled from 𝒩(𝟎,𝟏) and means element-wise product, we employ a decoder Gim to reconstruct 𝐟im based on the concatenation of 𝐳imap and 𝐟skst. By denoting 𝐟^im=Gim([𝐳imap,𝐟skst]), we employ a reconstruction loss ||𝐟im-𝐟^im||2 and an adversarial loss implemented based on the discriminator Dim, which distinguishes generated image features from real ones, leading to the following loss function:

tlim=adim+reim=-log(Dim(𝐟^im))+||𝐟im-𝐟^im||2. (7)

By performing image-to-sketch translation, we expect that the image structure features contain the necessary structure information to reconstruct the sketch features of the same category. By performing sketch-to-image translation, we expect that the image appearance features contain the necessary appearance information to compensate for the sketch structure features when reconstructing image features. Therefore, bi-directional domain translation could cooperate with ranking loss and orthogonal loss to assist feature disentanglement.

Finally, recall that the discriminator Dim (resp., Dsk) is trained to distinguish the generated image (resp., sketch) features from the real ones. So the loss functions for discriminators can be written as

LDim =-log(1-Dim(𝐟^im))-log(Dim(𝐟im)), (8)
LDsk =-log(1-Dsk(𝐟^sk))-log(Dsk(𝐟sk)). (9)

3.3 Training and Retrieval

The full objective function can be divided into the generation loss and the discrimination loss, which can be expressed as

G =or+kl+λ1rk+λ2tlim+tlsk, (10)
D =LDim+LDsk, (11)

in which λ1 and λ2 are empirically set as 0.5 and 2.0 respectively. Our model consists of generators and discriminators, in order to stabilize the training process, we follow the training strategy in GAN [18] to update them alternatingly with ND and NG iterations respectively to minimize G and D.

In the test stage, we perform retrieval in both structure feature space and image feature space. Specifically, given a sketch 𝐱sk and an image 𝐱im, we compare them in both feature spaces.

1) Structure feature space: We project image feature 𝐟im and sketch feature 𝐟sk into the shared structure feature space by 𝐟imst=Eimst(𝐟im) and 𝐟skst=Eskst(𝐟sk) respectively. Then, we calculate the cosine distance 1-cos(𝐟imst,𝐟skst).

2) Image feature space: Based on the sketch structure feature 𝐟skst and a variational appearance feature sampled from 𝒩(𝟎,𝟏), we can employ the decoder Gim to generate an image feature. We can generate N image features vectors by sampling N times (N=200 in our experiments) and average them to represent the final image feature 𝐟^im:

𝐟^im=1Ni=1NGim([𝐟skst,𝐳i]), (12)

where 𝐳i is sampled from 𝒩(𝟎,𝟏). Then, we calculate the cosine distance 1-cos(𝐟^im,𝐟im).

Finally, we calculate the weighted average of two distances for retrieval:

𝒟fusion= ω(1-cos(𝐟^im,𝐟im))+ (13)
(1-ω)(1-cos(𝐟imst,𝐟skst)),

where ω is a hyper-parameter to balance two feature spaces and set as 0.5 by default.

Method Sketchy Ext. (aligned) Sketchy Ext. (unaligned) TU-Berlin Ext.
[email protected](%) [email protected](%) [email protected](%) [email protected](%) [email protected](%) [email protected](%)
SBIR Cosine 9.0 5.1 9.0 5.1 4.6 2.0
3D shape [58] 6.1 1.0 7.0 1.8 3.6 0.5
SaN [64] 15.3 5.8 18.9 8.5 10.1 4.2
Siamese [48] 24.4 14.6 25.6 15.3 8.3 3.7
ZSL ESZSL [49] 16.0 8.3 17.2 9.5 4.8 1.7
SAE [31] 24.4 14.6 27.1 17.5 11.6 5.5
CMT [55] 26.9 17.6 27.5 17.7 10.0 4.3
SSE [65] 6.9 2.3 7.3 3.3 4.1 1.2
DeViSE [16] 14.3 4.7 15.4 5.4 8.0 2.2
ZS-SBIR CVAE [62] 33.4 22.6 31.2 19.9 10.2 4.9
SEM-PCYC [12] 28.0 17.7 30.0 19.4 12.4 5.7
Xu et al. [60] 20.4 12.0 20.8 12.6 7.4 2.9
BDT-St 36.1 25.5 36.9 25.8 15.2 7.9
BDT-Im 37.2 26.8 35.1 24.9 14.7 7.1
BDT 41.2 29.9 39.7 28.1 17.6 10.2
Table 1: Comparison of our BDT method and baselines on Sketchy and TU-Berlin. Best results are denoted in boldface.

4 Experiment

4.1 Experiment Setup

4.1.1 Dataset

We evaluate our BDT framework and baselines on two large-scale sketch-image datasets: TU-Berlin [13] and Sketchy [51] with extended images obtained from [36].

Sketchy (Extended) [51] is originally comprised of 75,479 sketches and 12,500 images from 125 categories, where the images and sketches are aligned pairs. Liu et al. [36] extended the image retrieval gallery by collecting extra 60,502 images, so that the total number of images in extended Sketchy reaches 73,002. Following the standard zero-shot setting in [62], we partition the total 125 categories into 104 seen categories and 21 unseen categories according to whether the category appears in the 1,000 categories of ImageNet [10], which avoids violating the zero-shot assumption when utilizing models pre-trained on ImageNet. In the training stage, there were previously two settings about how to utilize the training data: 1) use aligned pairs without extended training images [62], which is referred to “aligned” in Table 1; 2) do not use the information of aligned pairs but use all training data including extended images [52], which is referred to as “unaligned” in Table 1.

TU-Berlin (Extended) [13] contains 250 categories with a total of 20, 000 sketches extended by [36] with 204,489 natural images based on the sketch categories. Following the same split criterion as Sketchy, we first split the TU-Berlin into 165 seen categories and 85 unseen categories according to whether the category appears in the 1,000 categories of ImageNet [10]. As Shen et al. [52] suggest, we re-select unseen categories with more than 400 images out of the 85 categories. In the end, there are 186 seen and 64 unseen categories 11 1 The detailed category split will be found in Appendix. Compared with the Sketchy dataset, TU-Berlin is much more challenging because of more unseen categories and fewer training sketches.

4.1.2 Implementation Details

We implement our method and all the other baselines using PyTorch [47], which are all trained on one GTX 1080Ti GPU. We use a VGG-16 (pre-trained on ImageNet dataset) to extract the image and sketch features. As Sec. 3.2 mentioned, we concatenate the output of multiple layers, leading to a 5568-dim vector for each image and sketch. For each encoder, we use two fully-connected (FC) layers with Batch Normalization and ReLU as activation. For the variational estimator, we use two individual FC layers to obtain the mean and variance of approximated 𝐳imap separately. For each decoder, we use two FC layers with ReLU activation. For discriminators, we use two FC layers with Batch Normalization and LeakyReLU as activation. The dimensionality of 𝐟skst, 𝐟imst, 𝐟imap, 𝐳imap are all 1024.

We use Adam [28] optimizer with learning rate 2×10-4, β1=0.5, β2=0.999 for bi-directional translation model, and use SGD optimizer with learning rate 1×10-2, momentum =0.9 for the discriminators. The batch size for Sketchy (resp., TU-Berlin) is 128 (resp., 64) and the maximum number of training epochs is 30. The numbers of iterations for training generator (NG) and discriminator (ND) are 100 and 50 respectively. For Sketchy dataset, we conduct experiments in both “unaligned” and “aligned” settings (see Table 1), whereas there is only “unaligned” setting for TU-Berlin dataset because TU-Berlin does not have aligned pairs. Following [62], we use mean average precision and precision considering top 200 retrievals ([email protected] and [email protected]) as the evaluation metric.

4.2 Comparison with Existing Methods

We compare our model with 12 prior methods, which can be divided into three categories: sketch-based image retrieval (SBIR) baselines, zero-shot learning (ZSL) baselines, zero-shot sketch-based image retrieval (ZS-SBIR) baselines. The SBIR baselines include Siamese [48], SaN [64], and 3D shape [58]. A cosine baseline is also added, which conducts nearest neighbor search based on 4096-dim VGG-16 [54] feature vectors. The ZSL baselines include ESZSL [49], SAE [31], CMT [55], SSE [65], and DeViSE [16]. The ZS-SBIR baselies include CVAE [62], SEM-PCYC [12], and Xu et al. [60]. For a fair comparison, we replace the backbone of all previous models by VGG-16 except SaN, which designs a new backbone to extract sketch and image features. All the backbones are pre-trained on ImageNet. Note that our method does not rely on semantic information obtained from large textual corpus (e.g., word vector [44] and WordNet [45]). To make a fair comparison, for those baselines which require additional semantic information, we remove the semantic information [12] or replace the semantic information by the average of image features within each category [58, 49, 31, 55, 65, 16, 60] 22 2 The results of ZSIH [52] become much worse after using this strategy, and thus we omit its results in Table 1.. In fact, we have tried both “remove” and “replace” strategies for all these baselines if applicable, and select the better one for each baseline. Besides, we do not compare with the methods that fine-tune the pre-trained backbone during training, like SAKE [37] and EMS [40], because they learn four times more model parameters than ours.

Sketchy Ext. (aligned) Sketchy Ext. (unaligned)
[email protected](%) [email protected](%) [email protected](%) [email protected](%)
w/o Lrk 35.1 23.2 31.7 20.3
w/o Lor 40.3 29.1 38.4 26.9
w/o Lreim 31.7 19.8 32.1 20.5
w/o Lresk 39.9 28.3 38.3 27.8
w/o Ladim 40.0 28.3 35.5 24.9
w/o Ladsk 40.7 29.6 39.3 27.9
alternative Lor 39.1 27.4 37.9 26.6
w/o appearance 37.2 26.0 36.5 25.9
Table 2: Ablation Studies of our method on Sketchy dataset.

Based on Table 1, we can find that most of the SBIR and ZSL baselines under-perform the ZS-SBIR baselines. Compared with Cosine, 3D shape [58] and SSE [65] perform even worse, which indicates these methods heavily overfit on the seen categories. On Sketchy dataset, we observe that the results in “unaligned” setting are usually better than the corresponding results in “aligned” setting, mainly because that the amount of unaligned data is five times larger than that of aligned data. However, CVAE exhibits the opposite tendency because the aligned sketch-image pairs could help reconstruct images from their paired sketches. On the TU-Berlin dataset, the overall results are worse than those reported in previous works [37, 12] due to different seen/unseen splits. In particular, the number of unseen categories under our split is two times larger than that in [37, 12], and our split criterion also guarantees no information leak from ImageNet to unseen categories.

In terms of [email protected], our proposed BDT excels the state-of-the-art methods by 7.8% on the Sketchy (aligned) dataset, 8.5% on Sketchy (unaligned) dataset, and 5.2% on TU-Berlin dataset. To better understand our method, we also list our results by performing retrieval only in the image feature space or structure feature space as BDT-Im and BDT-St, respectively. Referring to the comparison between BDT-Im and CVAE as well as the comparison between BDT-St and Siamese, the disentangled representations indeed help the model to generalize from seen to unseen categories. Besides, by comparing BDT with BDT-Im and BDT-St, we can see that the combination of image feature space and structure feature space can boost the performance by a large margin, which indicates the complementarity of two feature spaces.

Figure 3: (a) The performance variance of our method when setting ω in the range of [0,1], where Sk (a), Sk (u) and TU represent Sketchy (aligned), Sketch (unaligned) and TU-Berlin respectively. (b) The performance and orthogonality variance of our method along with the training epoch.
Figure 4: The top-5 images retrieved by BDT, BDT-St, BDT-Im, CVAE methods on Sketchy test set. The green (resp., red) border indicates the correct (resp., incorrect) retrieval results.

4.3 Ablation Study

By taking the Sketchy dataset as an example, we analyze the effect of different loss functions and alternative model designs as well as the effect of ω.

Study on loss terms: We ablate each loss term in (3), (4), (5) and (7), and report the results in Table 2. As expected, the ranking loss and the image reconstruction loss are the most important losses, because these two losses mainly control the image-sketch distance in their corresponding feature spaces. Besides, the image reconstruction loss has larger impact in “aligned” setting than “unaligned” setting, which implies that the reconstruction loss is sensitive to the pose variance in unaligned data. In contrast, the image adversarial loss has larger impact in “unaligned” setting, which shows that the adversarial loss can enhance the robustness of our model in “unaligned” setting.

Study on alternative model designs: In the last two rows in Table 2, we report the results of two alternative designs: (1) move the orthogonal loss from (𝐟imap,𝐟imst) to (𝐳imap,𝐟imst); (2) directly translate from sketch structure feature to image feature without using the image appearance feature 𝐳imap. We can observe the performance drop in both cases, which demonstrates that we have placed the orthogonal loss at the proper position, and the appearance compensation is crucial for generating image features.

Study on retrieval strategy: In Figure 3a, we plot the ω[email protected] curve. It can be seen that our method can generally achieve competitive results when setting ω in a proper range, e.g., [0.4,0.6].

4.4 Disentanglement Analysis

To demonstrate the ability of our model to disentangle the image features, we first plot orthogonal loss and [email protected] along with the training epoch in Figure 3b. It can be seen that the orthogonal loss decreases as [email protected] increases, which indicates that our method benefits from the disentanglement of image features.

Figure 5: The t-SNE visualization of six types of features on Sketchy test set. Best viewed in color.

Then, in Figure 5, we visualize six types of features from 10 randomly selected unseen categories using t-SNE [42]: image appearance features, image structure features, sketch features, sketch structure features, sketch translated image features. According to Figure 5, we have the following observations: 1) Different categories can be separated very well in “image structure” and “sketch structure”, which significantly facilitates the retrieval in structure feature space; 2) The results in “image structure” and “image appearance” are complementary, in accordance with the disentanglement between structure features and appearance features; 3) The results in “image features” and “translated image features” are similar, which shows the effectiveness of image feature reconstruction; 4) The results in “sketch features” show the relatively poor separability of sketch features, which makes sketch feature space ill-suited for image retrieval.

4.5 Case Study

In Figure 4, we show the retrieval results of BDT, BDT-St, BDT-Im, and CVAE [62]. One interesting observation is that BDT-ST could capture the correspondence of local structure information, while BDT-Im behaves like CVAE and focuses on global pose/structure information. For example, given a “door” sketch, the retrieved images of both CVAE and BDT-Im have the global grid structure similar to the given sketch, but BDT-St could capture the correspondence between the retrieved images and the given sketch w.r.t. certain local structure information like door-case. One possible explanation is that the structure features are trained by aligning different domains into a shared space; however, the reconstructed image features are trained by aligning the sketch features to image features, which makes the former more flexible and tolerant to the difference between image and sketch. Moreover, BDT combines the strengths of both BDT-St and BDT-Im, producing better retrieval results.

5 Conclusion

We have studied zero-shot sketch-based image retrieval (ZS-SBIR) from a new viewpoint, i.e., using disentangled representation to facilitate structure-based retrieval. We have proposed our Bi-directional Domain Translation (BDT) framework, which performs retrieval in two feature spaces. Comprehensive experiments on Sketchy (aligned/unaligned) and TU-Berlin datasets have demonstrated the generalization ability of our framework from seen categories to unseen categories.

References

  • [1] X. Cao, H. Zhang, S. Liu, X. Guo, and L. Lin (2013) Sym-fish: a symmetry-aware flip invariant sketch histogram shape descriptor. In ICCV, Cited by: §1.
  • [2] Y. Cao, C. Wang, L. Zhang, and L. Zhang (2011) Edgel index for large-scale sketch-based image search. In CVPR, Cited by: §1.
  • [3] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang (2010) Mindfinder: interactive sketch-based image search on millions of images. In ACM MM, Cited by: §1.
  • [4] J. Chen and Y. Fang (2018) Deep cross-modality adaptation via semantics preserving adversarial learning for sketch-based 3d shape retrieval. In ECCV, Cited by: §2.1.
  • [5] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In NeurIPS, Cited by: §2.2.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §2.2.
  • [7] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, Cited by: §2.3.
  • [8] S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Cited by: §2.1.
  • [9] A. Del Bimbo and P. Pala (1997) Visual image retrieval by elastic matching of user sketches. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (2), pp. 121–132. Cited by: §1.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §2.1, §3.2.1, §4.1.1, §4.1.1.
  • [11] E. Dupont (2018) Learning disentangled joint continuous and discrete representations. In NeurIPS, Cited by: §1, §2.2.
  • [12] A. Dutta and Z. Akata (2019) Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR, Cited by: §1, §2.1, Table 1, §4.2, §4.2.
  • [13] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. In SIGGRAPH, Cited by: §4.1.1, §4.1.1.
  • [14] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa (2010) An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics 34 (5), pp. 482–498. Cited by: §1, §2.1.
  • [15] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa (2010) Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE transactions on visualization and computer graphics 17 (11), pp. 1624–1636. Cited by: §1, §2.1.
  • [16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: A deep visual-semantic embedding model. In NeurIPS, Cited by: Table 1, §4.2.
  • [17] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In NeurIPS, Cited by: §2.3.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §3.3.
  • [19] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-VAE: Learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §2.2.
  • [20] R. Hu and J. Collomosse (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding 117 (7), pp. 790–806. Cited by: §1, §2.1.
  • [21] R. Hu, T. Wang, and J. Collomosse (2011) A bag-of-regions approach to sketch-based image retrieval. In ICIP, Cited by: §1, §2.1.
  • [22] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2.3.
  • [23] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §2.3.
  • [24] S. James, M. J. Fonseca, and J. Collomosse Reenact: sketch based choreographic design from archival dance footage. In ICMR, Cited by: §1.
  • [25] A. H. Jha, S. Anand, M. Singh, and V. Veeravasarapu (2018) Disentangling factors of variation with cycle-consistent variational auto-encoders. In ECCV, Cited by: §2.2.
  • [26] H. Kim and A. Mnih (2018) Disentangling by factorising. In ICML, Cited by: §2.2.
  • [27] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §2.3.
  • [28] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. ICLR. Cited by: §4.1.2.
  • [29] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §3.2.3.
  • [30] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In NeurIPS, Cited by: §2.2.
  • [31] E. Kodirov, T. Xiang, and S. Gong (2017) Semantic autoencoder for zero-shot learning. In CVPR, Cited by: Table 1, §4.2.
  • [32] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In ECCV, Cited by: §2.3.
  • [33] H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. Singh, and M. Yang (2019) DRIT++: Diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.3.
  • [34] K. Li, K. Pang, Y. Song, T. Hospedales, H. Zhang, and Y. Hu (2016) Fine-grained sketch-based image retrieval: the role of part-aware attributes. In WACV, Cited by: §1.
  • [35] A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In NeurIPS, Cited by: §2.2.
  • [36] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In CVPR, Cited by: §1, §4.1.1, §4.1.1, §4.1.1.
  • [37] Q. Liu, L. Xie, H. Wang, and A. Yuille (2019) Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In ICCV, Cited by: §1, §2.1, §4.2, §4.2.
  • [38] Y. Liu, Z. Wang, H. Jin, and I. Wassell (2018) Multi-task adversarial network for disentangled feature learning. In CVPR, Cited by: §2.2.
  • [39] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang (2018) Exploring disentangled feature representation beyond face identification. In CVPR, Cited by: §2.2.
  • [40] P. Lu, G. Huang, Y. Fu, G. Guo, and H. Lin (2018) Learning large euclidean margin for sketch-based image retrieval. arXiv preprint arXiv:1812.04275. Cited by: §1, §4.2.
  • [41] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz (2018) Disentangled person image generation. In CVPR, Cited by: §2.2.
  • [42] L. v. d. Maaten and G. Hinton Visualizing data using t-SNE. Journal of machine learning research 9. Cited by: §4.4.
  • [43] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In NeurIPS, Cited by: §2.2.
  • [44] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, Cited by: §4.2.
  • [45] G. A. Miller (1998) WordNet: An electronic lexical database. Cited by: §4.2.
  • [46] S. Parui and A. Mittal (2014) Similarity-invariant sketch-based image retrieval in large databases. In ECCV, Cited by: §1.
  • [47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, Cited by: §4.1.2.
  • [48] Y. Qi, Y. Song, H. Zhang, and J. Liu (2016) Sketch-based image retrieval via siamese convolutional neural network. In ICIP, Cited by: §1, §2.1, Table 1, §4.2.
  • [49] B. Romera and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In ICML, Cited by: Table 1, §4.2.
  • [50] J. M. Saavedra, J. M. Barrios, and S. Orand (2015) Sketch based image retrieval using learned keyshapes (LKS).. In BMVC, Cited by: §1, §2.1.
  • [51] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35 (4), pp. 119. Cited by: §1, §2.1, §4.1.1, §4.1.1.
  • [52] Y. Shen, L. Liu, F. Shen, and L. Shao (2018) Zero-shot sketch-image hashing. In CVPR, Cited by: §1, §1, §2.1, §4.1.1, §4.1.1, footnote 2.
  • [53] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras (2017) Neural face editing with intrinsic image disentangling. In CVPR, Cited by: §2.2.
  • [54] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: Figure 2, §4.2.
  • [55] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In NeurIPS, Cited by: Table 1, §4.2.
  • [56] J. Song, Q. Yu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In CVPR, Cited by: §2.1.
  • [57] L. Tran, X. Yin, and X. Liu (2017) Disentangled representation learning gan for pose-invariant face recognition. In CVPR, Cited by: §2.2.
  • [58] F. Wang, L. Kang, and Y. Li (2015) Sketch-based 3d shape retrieval using convolutional neural networks. In CVPR, Cited by: §1, §1, Table 1, §4.2, §4.2.
  • [59] H. Wang, C. Deng, X. Xu, W. Liu, X. Gao, and D. Tao (2019) Stacked semantic-guided network for zero-shot sketch-based image retrieval. arXiv preprint arXiv:1904.01971. Cited by: §1, §2.1, §3.1, §3.2.1.
  • [60] X. Xu, H. Wang, L. Li, and C. Deng (2019) Semantic adversarial network for zero-shot sketch-based image retrieval. arXiv preprint arXiv:1905.02327. Cited by: §1, §2.1, Table 1, §4.2.
  • [61] X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2Image: Conditional image generation from visual attributes. In ECCV, Cited by: §2.2.
  • [62] S. K. Yelamarthi, S. K. Reddy, A. Mishra, and A. Mittal (2018) A zero-shot framework for sketch based image retrieval. In ECCV, Cited by: §1, §1, §2.1, §3.1, Table 1, §4.1.1, §4.1.2, §4.2, §4.5.
  • [63] Q. Yu, F. Liu, Y. Song, T. Xiang, T. M. Hospedales, and C. Loy (2016) Sketch me that shoe. In CVPR, Cited by: §1, §1.
  • [64] Q. Yu, Y. Yang, F. Liu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Sketch-a-net: a deep neural network that beats humans. International journal of computer vision 122 (3), pp. 411–425. Cited by: Figure 1, §1, Table 1, §4.2.
  • [65] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pp. 4166–4174. Cited by: Table 1, §4.2, §4.2.
  • [66] S. Zhao, J. Song, and S. Ermon (2017) InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262. Cited by: §2.2.
  • [67] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In CVPR, Cited by: §2.2.
  • [68] R. Zhou, L. Chen, and L. Zhang (2012) Sketch-based image retrieval on a large scale database. In ACM MM, Cited by: §1.
  • [69] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, Cited by: §2.3.
  • [70] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In NeurIPS, Cited by: §2.3.

Appendix

In the following, we provide the training (seen) and testing (unseen) category split on two datasets used in our experiments.

1) Split for Sketchy dataset

  • Training Categories: squirrel, turtle, tiger, bicycle, crocodilian, frog, bread, hedgehog, hot-air_balloon, ape, elephant, geyser, chicken, ray, fan, hotdog, pizza, duck, piano, armor, axe, hammer, camel, horse, spider, kangaroo, mushroom, owl, seal, table, hermit_crab, zebra, car_(sedan), shark, flower, guitar, bench, wine_bottle, fish, snail, deer, knife, airplane, sea_turtle, hat, eyeglasses, parrot, bee, tank, lion, swan, penguin, violin, rabbit, motorcycle, lobster, sheep, snake, shoe, hamburger, teddy_bear, pretzel, alarm_clock, church, ant, trumpet, candle, chair, hourglass, cat, scorpion, bear, dog, beetle, cannon, pig, cup, crab, pickup_truck, pineapple, apple, lizard, sailboat, spoon, umbrella, rocket, teapot, couch, butterfly, blimp, jellyfish, rifle, starfish, banana, wading_bird, bell, pistol, saxophone, strawberry, jack-o-lantern, castle, racket, harp, volcano

  • Test Categories: bat, cabin, cow, dolphin, door, giraffe, helicopter, mouse, pear, raccoon, rhinoceros, saw, scissors, seagull, skyscraper, songbird, sword, tree, wheelchair, windmill, window

2) Split for TU-Berlin dataset

  • Training Categories: arm, ashtray, axe, baseball bat, blimp, brain, bulldozer, bush, cake, chandelier, cloud, cow, crown, dolphin, donut, dragon, duck, eyeglasses, giraffe, grapes, grenade, head, head-phones, helicopter, horse, lightbulb, megaphone, microscope, mosquito, octopus, paper clip, pear, person walking, pigeon, pipe (for smoking), pumpkin, rainbow, rooster, satellite, satellite dish, scissors, seagull, skateboard, skyscraper, snowboard, stapler, suitcase, sun, sword, tire, toilet, tomato, toothbrush, trousers, walkie talkie, windmill, wrist-watch, carrot, key, palm tree, parrot, rollerblades, suv, tree

  • Test Categories: airplane, alarm clock, angel, ant, apple, armchair, backpack, banana, barn, basket, bathtub, bear (animal), bed, bee, beer-mug, bell, bench, bicycle, binoculars, book, bookshelf, boomerang, bottle opener, bowl, bread, bridge, bus, butterfly, cabinet, cactus, calculator, camel, camera, candle, cannon, canoe, car (sedan), castle, cat, cell phone, chair, church, cigarette, comb, computer monitor, computer-mouse, couch, crab, crane (machine), crocodile, cup, diamond, dog, door, door handle, ear, elephant, envelope, eye, face, fan, feather, fire hydrant, fish, flashlight, floor lamp, flower with stem, flying bird, flying saucer, foot, fork, frog, frying-pan, guitar, hamburger, hammer, hand, harp, hat, hedgehog, helmet, hot air balloon, hot-dog, hourglass, house, human-skeleton, ice-cream-cone, ipod, kangaroo, keyboard, microphone, monkey, moon, motorbike, mouse (animal), mouth, mug, mushroom, nose, owl, panda, parachute, parking meter, pen, penguin, person sitting, piano, pickup truck, pig, pineapple, pizza, potted plant, power outlet, present, pretzel, purse, rabbit, race car, radio, revolver, rifle, sailboat, santa claus, saxophone, scorpion, screwdriver, sea turtle, shark, sheep, ship, shoe, shovel, skull, snail, snake, snowman, socks, space shuttle, speed-boat, spider, sponge bob, spoon, squirrel, standing bird, strawberry, streetlight, submarine, swan, syringe, t-shirt, table, tablelamp, teacup, teapot, teddy-bear, telephone, tennis-racket, tent, tiger, tooth, tractor, traffic light, train, trombone, truck, trumpet, tv, umbrella, van, vase, violin, wheel, wheelbarrow, wine-bottle, wineglass, zebra