Domain-Specific Embedding Network for Zero-Shot Recognition

  • 2019-08-12 14:32:50
  • Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, Yongdong Zhang
  • 1

Abstract

Zero-Shot Learning (ZSL) seeks to recognize a sample from either seen orunseen domain by projecting the image data and semantic labels into a jointembedding space. However, most existing methods directly adapt a well-trainedprojection from one domain to another, thereby ignoring the serious biasproblem caused by domain differences. To address this issue, we propose a novelDomain-Specific Embedding Network (DSEN) that can apply specific projections todifferent domains for unbiased embedding, as well as several domainconstraints. In contrast to previous methods, the DSEN decomposes thedomain-shared projection function into one domain-invariant and twodomain-specific sub-functions to explore the similarities and differencesbetween two domains. To prevent the two specific projections from breaking thesemantic relationship, a semantic reconstruction constraint is proposed byapplying the same decoder function to them in a cycle consistency way.Furthermore, a domain division constraint is developed to directly penalize themargin between real and pseudo image features in respective seen and unseendomains, which can enlarge the inter-domain difference of visual features.Extensive experiments on four public benchmarks demonstrate the effectivenessof DSEN with an average of $9.2\%$ improvement in terms of harmonic mean. Thecode is available in \url{https://github.com/mboboGO/DSEN-for-GZSL}.

 

Quick Read (beta)

Domain-Specific Embedding Network for Zero-Shot Recognition

Shaobo Min [email protected] University of Science and Technology of China Hantao Yao [email protected] National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences Hongtao Xie [email protected] University of Science and Technology of China Zheng-Jun Zha [email protected] University of Science and Technology of China  and  Yongdong Zhang [email protected] University of Science and Technology of China
Abstract.

Zero-Shot Learning (ZSL) seeks to recognize a sample from either seen or unseen domain by projecting the image data and semantic labels into a joint embedding space. However, most existing methods directly adapt a well-trained projection from one domain to another, thereby ignoring the serious bias problem caused by domain differences. To address this issue, we propose a novel Domain-Specific Embedding Network (DSEN) that can apply specific projections to different domains for unbiased embedding, as well as several domain constraints. In contrast to previous methods, the DSEN decomposes the domain-shared projection function into one domain-invariant and two domain-specific sub-functions to explore the similarities and differences between two domains. To prevent the two specific projections from breaking the semantic relationship, a semantic reconstruction constraint is proposed by applying the same decoder function to them in a cycle consistency way. Furthermore, a domain division constraint is developed to directly penalize the margin between real and pseudo image features in respective seen and unseen domains, which can enlarge the inter-domain difference of visual features. Extensive experiments on four public benchmarks demonstrate the effectiveness of DSEN with an average of 9.2% improvement in terms of harmonic mean. The code is available in https://github.com/mboboGO/DSEN-for-GZSL.

zero-shot learning, categorization, joint embedding, neural networks
journalyear: 2019conference: Proceedings of the 27th ACM International Conference on Multimedia; October 21–25, 2019; Nice, Francebooktitle: Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), October 21–25, 2019, Nice, Franceprice: 15.00doi: 10.1145/3343031.3351092isbn: 978-1-4503-6889-6/19/10ccs: Computing methodologies Object recognitionccs: Computing methodologies Neural networksccs: Computing methodologies Image representationsccs: Computing methodologies Learning latent representations
Figure 1. A diagram of generalized zero-shot recognition, which associates images and semantic labels in a joint embedding space.

1. Introduction

Traditional recognition tasks has progressed with the help of massive labeled images and deep models (Simonyan and Zisserman, 2014; He et al., 2016; He and Peng, 2018; Fang et al., 2018; Zheng et al., 2018; Wang et al., 2018; Xie et al., 2019; Wang et al., 2019). However, their major disadvantage is that they cannot recognize the images belonging to unseen categories, and it is laborious to collect sufficient labeled images for various tasks. To tackle this problem, generalized Zero-Shot Learning (ZSL) (Palatucci et al., 2009; Akata et al., 2013; Lampert et al., 2014; Romera-Paredes and Torr, 2015; Yang et al., 2016; Morgado and Vasconcelos, 2017; Xian et al., 2018a; Long et al., 2018) has attracted a lot of attention in recent years. A generalized zero-shot recognition is defined as recognizing a sample from either seen or unseen domain, which contains disjoint categories. A general paradigm is to project the image data and semantic labels, e.g., category attributes (Farhadi et al., 2009; Lampert et al., 2009), into a joint embedding space, where recognition becomes a nearest neighbor searching problem (Song et al., 2018). A visual diagram is shown in Figure 1. The major challenge of this paradigm is that the different data distributions between two domains lead to serious domain shift problems (Fu et al., 2014; Kodirov et al., 2015), which make the embedding features biased towards the seen domain.

Figure 2. Comparison of DSEN with related GZSL paradigms. a) The embedding space is spanned by visual features. b) Preserving the category relationship in the embedding space. c) the proposed DSEN, which introduces two extra ϕs and ϕt to model domain-specific knowledge and a domain division constraint ddc to enlarge the inter-domain difference.

To address this issue, existing methods focus on learning a robust projection between visual representations and semantic labels. The related methods can be coarsely classified into two classes: embedding-based framework and semantic-preserving framework. The embedding-based framework (Tomasev et al., 2014; Romera-Paredes and Torr, 2015; Akata et al., 2016; Jiang et al., 2017; Kumar Verma et al., 2018) aims to establish a discriminative embedding space that is shared across two domains. Two commonly used embedding space is spanned by visual representations (Zhang et al., 2017) and semantic labels (Kumar Verma et al., 2018), respectively. Taking Figure 2 (a) as an example, the semantic labels are projected into the visual space to match with the corresponding visual representations, which has been proved robust to Hubness problems (Tomasev et al., 2014; Lazaridou et al., 2015). Different from the embedding-based framework, the semantic-preserving framework (Kodirov et al., 2017; Annadani and Biswas, 2018; Chen et al., 2018; Xian et al., 2018a) focuses on preserving the semantic prototype in an embedding space through an auto-encoder architecture. An example is shown in Figure 2 (b). Although the above methods are effective for zero-shot problems, they all employ a single shared-projection for both seen and unseen domains. Due to the broad domain gap, a shared projection function cannot model the full specialty of each domain, leading to biased recognition problem.

In this paper, we propose a novel Domain-Specific Embedding Network (DSEN) to alleviate the domain shift problem in ZSL by applying specific projections to different domains, as well as several domain constraints. The novelties of DSEN over the previous methods are shown in Figure 2 (c). Instead of a single shared projection, DSEN decomposes the projection function into three components: domain-invariant projection ϕc, seen domain-specific projection ϕs, and unseen domain-specific projection ϕt. The ϕc targets to capture the common projection knowledge between two domains, and ϕs and ϕt are used to capture the domain-specific projection knowledge. Notably, ϕs and ϕt should project the semantic labels of different domains into a shared embedding space E in Figure 2 (c) for cross-domain recognition. To this end, a semantic reconstruction constraint is designed, by applying the same decoder function to both ϕs and ϕt in a cycle consistency way, to preserve the shared semantic relationship in E. Compared to using a single shared projection, our domain-specific projections can generate less biased embedding features due to domain specialty modeling.

Furthermore, a domain division constraint is developed to enlarge both intra- and inter-domain discrimination of visual features, based on pseudo visual data in the unseen domain. Besides fully supervised learning in the seen domain, our domain division constraint restricts the noisy pseudo data to have a uniform label distribution in seen categories. The advantages of this constraint mechanism are as follows: a) being robust to the noisy pseudo visual features; b) directly enlarging the visual margin between two domains; and c) can be trained end-to-end. Consequently, the decision boundary of visual features between two domains becomes more clear, which allows us to utilize specific classifiers in determinate searching space for unbiased recognitions.

Our contributions are threefold:

  • We propose a novel Domain-Specific Embedding Network (DSEN) by applying specific projections to two domains, which can better capture domain similarities and differences for unbiased embedding.

  • A domain division constraint is designed to effectively enhance both intra- and inter-domain discrimination, based on real and pseudo visual data in two domains. Besides, it also enables DSEN to be trained end-to-end.

  • The proposed DSEN obtains the sate-of-the-art performance on four public datasets with an average of 9.2% improvement in terms of harmonic mean.

2. Related Work

Three types of related techniques are discussed in this section.

2.1. Embedding-based Zero-Shot Learning

A general paradigm of zero-shot recognition targets to project the image representations and semantic labels into a joint embedding space, where the recognition becomes a nearest neighbor searching problem (Song et al., 2018). This process is called as an embedding-based method, which is one of the most popular ZSL strategies. As the seen and unseen domains have disjointed categories, the additional semantic information, such as attributes (Farhadi et al., 2009; Lampert et al., 2009) and word vectors (Pennington et al., 2014; Niu et al., 2017), are used to construct a relationship between these two domains. Among these methods, Frome et al. (Frome et al., 2013) and Akata et al. (Akata et al., 2016) use the bilinear embedding model trained with a pairwise ranking loss. The ESZSL model (Romera-Paredes and Torr, 2015) constructs an embedding space with a Frobenius norm regularization, and Qiao et al. (Qiao et al., 2016) extend this work to online documents by suppressing the noise with an extra l1,2 norm. In addition, Akata et al. (Akata et al., 2015) build a joint embedding space with several compatibility functions, which is improved in (Xian et al., 2016) by incorporating latent variables. Zhang et al. (Zhang and Koniusz, 2018) employ a non-linear kernel to generate a mapping between visual representations and attributes. In spite of the promising performance, the above methods directly project the visual representations into space spanned by semantic labels, which suffer from Hubness problems (Radovanović et al., 2010; Tomasev et al., 2014; Lazaridou et al., 2015). The Hubness problem is defined as a few points being the nearest neighbors of most of the other points, which is caused by that projecting a visual feature with high dimensions into an attributes space with low dimensions shrinks the variance of the projected data points (Zhang et al., 2017). Therefore, a few methods (Shigeto et al., 2015; Zhang et al., 2017; Xian et al., 2018a) use an embedding space spanned by visual features, which is defined as a semantic-visual embedding. Although the previous methods are effective, insufficient semantic embedding limits their further applications due to serious domain shift problems. For example, a testing sample from an unseen domain tends to be recognized from one of the seen categories.

2.2. Semantic-Preserving Framework

To alleviate the domain shift problems, many recent works target to preserve the semantic prototype in an embedding space. The motivation under this tactic is that the semantic prototype is robust to domain change, which is beneficial to train a robust projection function. Among these methods, SAE (Kodirov et al., 2017) and SP_AEN (Chen et al., 2018) use an auto-encoder architecture on the embedded space to make their embedding features discriminative. Jiang et al. (Jiang et al., 2018) propose a coupled dictionary learning model to preserve the visual-semantic structures with semantic prototypes. Especially, Annadani et al. (Annadani and Biswas, 2018) preserve the semantic relationships in visual space by decomposing the relation between categories into three groups. Although the embedding-based methods and semantic-preserving framework can effectively solve the ZSL problems, they are mainly based on using a single shared projection across two domains, which ignore the large domain differences.

2.3. Synthetic Data-Based Methods

Recently, synthetic data-based methods (Bucher et al., 2017; Mishra et al., 2018; Long et al., 2018; Kumar Verma et al., 2018; Xian et al., 2018b) have been proposed, and they have obtained state-of-the-art performance. In contrast to embedding-based methods, they train a softmax classifier with full supervision on the union of real and synthetic visual data from seen and unseen domains. The synthetic visual data is obtained by a specific generator, such as GAN (Goodfellow et al., 2014) and its variants, based on the unseen domain attributes. Consequently, their models are more robust to domain shifts than embedding-based methods, based on both seen and unseen visual features. However, fully-supervised learning is sensitive to noisy synthetic visual data, which has not been fully exploited.

Figure 3. The pipeline of training Domain-Specific Embedding Network. Besides the domain-shared projection ϕc, DSEN trains two extra domain-specific ϕs and ϕt to better capture domain specialties. Furthermore, the domain division constraint ddc makes the visual embedding features in two domains distinguishable. The whole network is trained end-to-end.

3. Domain-Specific Embedding Network

We first describe the problem formulation of the Domain-Specific Embedding Network in Sec. 3.1 and then provide a detailed implementation of DSEN. The pipeline is shown in Figure 3.

3.1. Problem Formulation

Let 𝒮={(xs,ys,𝒂s)|xs𝒳s,ys𝒴s,𝒂s𝒜s} represents the seen domain dataset, where ys and 𝒂s are the class labels and semantic attributes for each image xs, respectively. 𝒯={(xt,yt,𝒂t)|xt𝒳t,yt𝒴t,𝒂t𝒜t} is similarly defined as the unseen domain dataset, where 𝒴s𝒴t=. Given the seen domain data 𝒮 and unseen domain labels 𝒴t with attributes 𝒜t, the target of a generalized ZSL task is to recognize an image from either 𝒳s or 𝒳t.

Based on the above definition, a basic objective for our DSEN is to:

(1) minWϕxs𝒳sd(f(xs),ϕ(𝒂s)),

where f() is the visual feature extractor for visual images. ϕ() is a semantic-visual projection function with trainable weights Wϕ. Notably, ϕ(𝒂s) is the semantic embedding. The distance function d() computes the negative cosine distance between two features 𝒗1 and 𝒗2 by:

(2) d(𝒗𝟏,𝒗𝟐)=-<𝒗𝟏,𝒗𝟐>||𝒗𝟏||2||𝒗𝟐||2.

In most existing methods (Shigeto et al., 2015; Zhang et al., 2017; Annadani and Biswas, 2018), ϕ is trained on 𝒮 and directly adapted to 𝒯 and f() is fixed by using pre-trained visual feature extractor.

3.2. Domain-Specific Projections

One leading cause of the domain shift problem is that a shared ϕc cannot model the full differences between two domains, thereby making generated embedding features towards the seen domain. Targeting to model the differences between two domains, we decompose the projection function into three parts, which are domain-invariant projection ϕc, seen domain-specific projection ϕs, and unseen domain-specific projection ϕt. Thus, the embedding features from seen and unseen domains become the combination of two sub-features:

(3) ϕ(𝒂)={ϕs(𝒂)+ϕc(𝒂)  if𝒂𝒜s,ϕt(𝒂)+ϕc(𝒂)  if𝒂𝒜t,

where ϕc is used to capture the common knowledge between two domains, and ϕs and ϕt capture the specific characteristics for seen and unseen domains, respectively. Compared to existing methods that use a single shared projection, the additional domain-specific projections can better accommodate domain differences, yielding more discriminative embedding features. By taking the ϕc and ϕs into consideration, Eq. (1) becomes minimizing:

(4) svs=xs𝒳sd(f(xs),ϕs(𝒂s)+ϕc(𝒂s)).

Different from ϕs, ϕt is difficult to train due to unavailable f(xt). Specifically, it is hard to constrain 𝒂s and 𝒂t to be projected into a shared embedding space using different ϕs and ϕt, for cross domain recognition. To achieve this goal, a semantic reconstruction constraint sr is designed by applying the same decoder function to both ϕs+ϕc and ϕt+ϕc for semantic label reconstruction. The motivation is that the semantic labels are shared across two domains; thus sr can constrain the semantic embedding from two specific projections to be associated in a shared embedding space. First, ϕt is initialized on the well-trained ϕs for projection knowledge transfer. Then, the semantic information in 𝒂s and 𝒂t are simultaneously encoded into ϕs and ϕt in a semantic cycle consistency way. Notably, using ϕs as the initialized ϕt can facilitate the convergence. Consequently, sr enables ϕt to capture the effective projection knowledge in the unseen domain based on 𝒜t, which will be illustrated subsequently.

Inspired by the applications of auto-encoding architecture in unsupervised representation learning, a domain-specific auto-encoder architecture is used to encode the semantic information in both ϕt and ϕs by:

(5) sr=𝒂s𝒜s||ϕsr(ϕs(𝒂s)+ϕc(𝒂s))-𝒂s||22+𝒂t𝒜t||ϕsr(ϕt(𝒂t)+ϕc(𝒂t))-𝒂t||22,

where ϕsr is a shared decoder function for both domains. From Eq. (5), ϕc has access to the semantic information in two domains, which can capture the domain similarity information. ϕs and ϕt only have access to domain-specific information, thereby rendering them to capture specific characteristics of two domains.

Finally, the objective function for domain-specific projections becomes:

(6) minWϕc,Wϕs,Wϕt,Wϕsrsvs+λ1sr,

where λ1 is a hyper-parameter used to balance different constraints. All the encoders ϕc, ϕs, ϕt and decoder ϕsr are implemented with two fully connection layers followed by ReLU activation.

Consequently, the domain-specific projections ϕs and ϕt assist ϕc to generate less biased embedding features during semantic-visual projection. The detailed architecture of our domain-specific projections is shown in Figure. 3.

3.3. Domain Division Constraint

Based on the embedding features from semantic attributes, we further propose a domain division constraint to make the embedding features between two domains distinguishable.

To achieve this goal, we first generate pseudo visual features from category attributes 𝒂t for the unseen domain. Especially, we regard ϕ(𝒂t) as pseudo visual features, because ϕ(𝒂t) and f(xt) have similar distributions based on well-trained semantic-visual projections. With the real visual features f(xs) and pseudo visual features ϕ(𝒂t), it is intuitive to train a |𝒴s𝒴t|-way softmax classifier, which can recognize visual samples from either seen or unseen domain. However, ϕ(𝒂t) is usually too noisy for a model to use fully supervised learning, which may deteriorate the model performance in the seen domain. Therefore, DDC just constraints the noisy pseudo features ϕ(𝒂t) to be far away from the seen categorizes, because 𝒴s and 𝒴t are disjoint. Thus, a |𝒴s|-way softmax classifier p is trained by minimizing:

(7) ddc=-xs𝒳slnpy*(f(xs))+α𝒂t𝒜tlnp^(ϕ(𝒂t)),

where py() is the classification score in terms of the ground truth label y*, and p^() is the maximum classification score in 𝒴s. The first term in Eq. (7) is a general cross-entropy softmax loss in the seen domain. The second term forces ϕ(𝒂t) to have a uniform label distribution in 𝒴s, which means that the ϕ(𝒂t) should not be recognized as a seen category.

Table 1. The details of the experimental datasets. |𝒴s| and |𝒴t| indicate the class numbers in the seen and unseen domains. The train/val/test indicates the image number of the respective split.
Datasets Attributes |𝒴s| |𝒴t| train val test
CUB 312 150 50 7,057 1,764 2,967
SUN 102 645 72 10,320 2,580 1,440
AWA2 85 40 10 23,527 5,882 7,913
aPY 64 20 12 5,932 1,483 7,924

α is a hyper-parameter that is used to balance the training effects between real and pseudo visual features on classifier p.

Based on Eq (7), the decision boundary between f(xs) and f(xt) can be determined by judging whether the label distribution of an input sample is smooth in 𝒴s. Especially for an input image from the seen domain, p^(f(x)) should be extremely large in terms of the true label. Conversely, the p^(f(x)) should be small, indicating a uniform label distribution for an image from the unseen domain. Furthermore, since the well-trained classifier p can only do categorization in the seen domain, we employ a ranking-based classifier, which is proposed based on nearest neighbor searching, to those samples that are suspected from the unseen domain. Thus, the final inference of our DSEN can be expressed by:

(8) y^={argmaxy𝒴spy(f(x))  ifp^(f(x))>τargminy𝒴td(f(x),ϕ(𝒂t))else,

where x𝒳s𝒳t, and y^ is the final prediction. τ is a threshold to determine the domain of an input sample.

With Eq. (8), we can divide the searching space for any samples into two sub-spaces. Once the samples coming from the unseen domain, the ranking-based classifier is used for recognition, of which the search space has been reduced to the unseen domain. For samples from the seen domain, the softmax classifier p can directly give the confident category predictions. By reducing the search space, the recognitions in both domains will be measurably improved, which is attributed to our domain division constraint ddc.

3.4. Overall Objective

Finally, the overall objective function of DSEN becomes:

(9) minWϕc,Wϕs,Wϕt,Wϕsr,Wfsvs+λ1sr+λ2ddc,

where Wf is the trainable parameters of visual feature extraction function f(). λ1 and λ2 balance different constraints. Notably, in many existing ZSL methods (Jiang et al., 2018; Chen et al., 2018; Xian et al., 2018b), f() is fixed across different datasets, leading to a weak visual representation f(x). Instead, ddc enables DSEN to be trained end-to-end with a trainable f(). Therefore, both visual representations and embedding features from our DSEN are powerful and discriminative.

4. Experiments

In this section, experimental analysis on four benchmarks is given to evaluate the proposed DSEN.

4.1. Experimental Settings

Datasets. We evaluate the proposed method on four widely used benchmarks: Caltech-USCD Birds-200-2011 (CUB) (Welinder et al., 2010), SUN (Patterson and Hays, 2012), Animals with Attributes 2 (AwA2) (Xian et al., 2018a), and Attribute Pascal and Yahoo (aPY) (Farhadi et al., 2009). All the datasets provide annotated attributes. The newly proposed splits of seen/unseen classes in (Xian et al., 2018a) are used for fair comparisons, which ensure that the test categories are strictly unseen in the pretrained visual projection network (Russakovsky et al., 2015). The details of the datasets are listed in Table 1.

Table 2. The detailed implementations of three baselines.
Setting Lsvs Lsr Lddc
S2V
DSP
DDC
DSEN (DSP+DDC)

Implementation details. The input images are resized to 480 along the short side, with data augmentation of 448×448 random cropping and horizontal flipping. The visual feature extraction network f() is based on the ResNet-101 architecture, which is pre-trained on the ImageNet dataset. The rest of the networks uses MSRA random initializer (He et al., 2016). In this work, we employ a two-stage training strategy to train the proposed DSEN. It first fixes f() and trains the rest with a large learning rate lr=1×e-3, and then it uses a small lr=1×e-5 to train the whole DSEN. The Adam optimizer is used with β=(0.5,0.999) and weight decay 5×e-5. For the hyper-parameters in DSEN, we set λ1=5 and λ2=1 to balance svs, sr, and ddc, and α=0.1 in ddc. The above hyper-parameter settings are determined according to experiments, and they are applicable to all of our experimental datasets. τ will be analyzed in the ablation study.

Evaluation metrics. Similar to (Xian et al., 2018a), the harmonic mean (H) is denoted in Eq. (10) to evaluate a model by:

(10) H=2×𝑀𝐶𝐴t×𝑀𝐶𝐴s𝑀𝐶𝐴t+𝑀𝐶𝐴s,

where MCAs and MCAt are the Mean Class top-1 Accuracy for the validation (seen) and testing (unseen) sets, respectively.

In the following parts, the experiments are mainly conducted under generalized ZSL settings, where the testing images come from either the seen or unseen domain.

Baselines. To demonstrate the effectiveness of different components in DSEN, three baselines are defined:

  • S2V is a general semantic-visual structure with shared projection function ϕc. The visual feature extraction function f() is fixed.

  • DSP adds two extra domain-specific projections ϕs and ϕt to S2V.

  • DDC applies the domain division constraint ddc to S2V, which makes the visual feature extractor f() trainable.

Finally, DSEN uses both domain-specific projections and ddc with a trainable f(). The details of each baseline are listed in Table 2.

Figure 4. Distributions of maximum classification score on four datasets. The vertical axis indicates the percentage of unseen domain samples. FSL and DDC represent the fully-supervised learning and our domain division constraint, respectively.
Table 3. The effects of each domain-specific projection on CUB.
Baseline ϕc ϕs ϕt MCAt MCAs
S2V 25.6 56.6
27.5 61.9
28.3 57.4
29.7 60.1
30.8 62.7

4.2. Ablation Studies

Effects of ϕc, ϕs, and ϕt. As the domain-specific projections consist of one domain-shared ϕc and two domain-specific ϕs and ϕt, we explore their effects by individually applying them to the baseline S2V. Table 3 shows the results. From Table 3, it is observed that applying ϕs and ϕt individually to ϕc yields improvements by 5.3% and 2.7% in terms of MACs and MACt. This proves that the domain-specific projections effectively capture their characteristic domain information via semantic reconstruction constraint sr. Then, we further explore the effects by using totally separated ϕs and ϕt without ϕc. The results show a slight improvement of 1.5% on MACt. The reason is that the connection between ϕs and ϕt is too weak to guarantee the projected embedding features to be in the same embedding space. Finally, combining ϕs, ϕt, and ϕc offers the best performance, which indicates that ϕc successfully captures the domain similarities during semantic-visual projection. These experiments prove the effectiveness of domain-specific projections in generating discriminative embedding features. In addition, random initialization of ϕt yields a relatively slow convergence speed.

Effects of domain-specific projections. As the domain-specific projections play a critical role in the proposed DSEN, we analyze the effect of applying domain-specific projections to different baselines. The related results are summarized in Table 4. From Table 4, we observe that applying domain-specific projections achieves better performance than using a single shared projection for all datasets, e.g., the DSP and DSEN both achieve 6.0% and 1.9% improvements on the S2V and DDC baselines in terms of H on CUB, respectively. Table 4 further shows that the domain-specific projections improve the recognition performance on both seen and unseen domains by about 2%6% on CUB. These achievements demonstrate the effectiveness of the proposed domain-specific projections.

Figure 5. The performance of DSEN with varying τ on different datasets.

Comparison of fully-supervised learning and Lddc. We further analyze the superiority of our domain division constraint ddc to fully-supervised learning. The analysis is performed by using noisy pseudo visual data for supervised training. Given a ZSL model, we denote p^(f(x)) as the maximum score of an image among seen categories. Thus, for an image from the seen domain, the p^(f(x)) should be extremely large in terms of the true label. Conversely, the p^(f(x)) should be small, indicating a uniform label distribution for an image from the unseen domain. To this end, we compare the p^(f(x)) for all unseen domain samples by individually applying fully-supervised learning and our ddc to baseline DSP with noisy pseudo data. The results are reported in Figure 4.

From Figure 4, it can be observed that, compared to fully supervised learning, ddc improves the percentage of samples with a small p^(f(x))<0.5 from 50% to 70% on CUB, 24% to 42% on AWA2, and 30% to 50% on aPY, approximatively. On SUN, the percentage of samples with p^(f(x))<0.3 is improved from 22% to 43%. Notably, more samples with a small p^(f(x)) in Figure 4 mean that more unseen domain samples can be distinguished from the seen domain samples. Therefore, compared to fully supervised learning, ddc makes the embedding features between two domains more distinguishable, which accounts for our impressive performance. Furthermore, it also shows that the softmax classifier p can model the decision boundary between two domain. Thus, using domain-specific classifiers are reasonable.

Effects of varying τ values. τ is a critical parameter to judge whether a testing sample is from an unseen domain. The results of varying τ values are shown in Figure 5. It can be found that, for the seen domain samples, a higher τ leads to a lower MCAs. The reason is that a higher τ may mistakenly give some seen domain samples to the ranking-based classifier, which degrades the MCAs. Conversely, in the unseen domain, the larger value the τ is, the higher MCAt our DSEN obtains. The reason is that a higher τ will feed a large number of unseen domain samples to the ranking-based classifier, which is good at unseen domain categorization and consequently improves the MACt. As the metric H is the combination of MCAs and MCAt, H does not have a consistent trend. With an increase of τ, H first increases to the optimal value and then drops. From Figure 5, it can be found that the optimal values for τ in different datasets are different, e.g., the optimal values for τ are 0.8, 0.5, 0.9, 0.8 for CUB, SUN, AWA2,and aPY, respectively.

Effects of domain division constraint Lddc. We then verify the effectiveness of using domain division constraint ddc. As shown in Table 4, the baseline DDC obtains a higher harmonic mean (H) than S2V on all four datasets. For example, on the AWA2 dataset, the DDC raises a harmonic mean (H) from 39.8% to 60.1% over S2V, which is mainly attributed to the significant 24.5% improvement of MACt. Further, with the domain division constraint ddc, the DSEN obtains a higher performance than DSP. These improvements confirm that the ddc can effectively make the embedding features more distinguishable, thereby rendering improved recognition in challenging unseen domains.

Furthermore, two domain-specific classifiers, based on ddc, also yield significant contributions to our impressive performance. As shown in Table 4, the main improvements of our DDC come from the high MACt in the unseen domain, which shows that the searching space reduction of the ranking-based classifier is an important factor of performance improvements. Especially, on the CUB dataset, the metric H of DDC is 12.9% higher than FGN (Kumar Verma et al., 2018) in terms of H, where the FGN uses a single softmax classifier in two domains. This also proves that using two domain-specific classifiers based on ddc is superior to using single shared classifier.

Feature visualizations of DSEN. Figure 6 shows the t-SNE of generated visual features by DSEN on CUB and AWA2 datasets, respectively. In each dataset, total 10 categories are randomly selected from the unseen domain. From the results, DSEN can not only preserve the semantic relationship in the embedding space but also obtain a large inter-class discrimination. This is attributed to DSP that captures accurate domain difference and DDC that enlarges the domain difference.

Figure 6. The t-SNE of visual features from DSEN on CUB and AWA2, respectively.

4.3. Comparison with existing methods

Comparison with generalized zero-shot learning. Table 4 illustrates comparison with previous methods on generalized ZSL. As shown in Table 4, our DSEN significantly outperforms existing methods on four datasets, e.g., DSEN obtains 15.0%, 1%, 3.5%, and 16.8% improvement in terms of metric H on CUB, SUN, AWA2, and aPY, respectively.

Table 4. Evaluation performance under generalized zeros-shot learning. NG indicates non-generative methods, and G indicates generative methods that use GAN.
Methods CUB (Welinder et al., 2010) SUN (Patterson and Hays, 2012) AWA2 (Xian et al., 2018a) aPY (Farhadi et al., 2009)
MCAt MCAs H MCAt MCAs H MCAt MCAs H MCAt MCAs H
NG CMT(Socher et al., 2013) 7.2 49.8 12.6 8.1 21.8 11.8 0.5 90.0 1.0 1.4 85.2 2.8
SYNC(Changpinyo et al., 2016) 11.5 70.9 19.8 7.9 43.3 13.4 10.0 90.5 18.0 7.4 66.3 13.3
SAE(Kodirov et al., 2017) 7.8 54.0 13.6 8.8 18.0 11.8 1.1 82.2 2.2 0.4 80.9 0.9
KL(Zhang and Koniusz, 2018) 19.9 52.5 28.9 19.8 29.1 23.6 17.6 80.9 29.0 11.9 76.3 20.5
PTZSL(Long et al., 2018) 23.0 51.6 31.8 19.0 32.7 24.0 - - - 15.4 71.3 25.4
CDL(Jiang et al., 2018) 23.5 55.2 32.9 21.5 34.7 26.5 - - - 19.8 48.6 28.1
PSR-ZSL(Annadani and Biswas, 2018) 24.6 54.3 33.9 20.8 37.2 26.7 20.7 73.8 32.2 13.5 51.4 21.4
SP-AEN(Chen et al., 2018) 34.7 70.6 46.6 24.9 38.6 30.3 23.3 90.9 37.1 13.7 63.4 22.6
G SE-ZSL(Kumar Verma et al., 2018) 41.5 53.3 46.7 40.9 30.5 34.9 58.3 68.1 62.8 - - -
FGN(Xian et al., 2018b) 43.7 57.7 49.7 42.6 36.6 39.4 - - - - - -
S2V 25.6 56.6 35.3 20.1 35.3 26.2 25.6 88.9 39.8 15.5 73.6 25.7
DSP 30.8 62.7 41.3 30.0 40.3 34.4 31.2 87.9 46.1 18.1 73.1 29.0
DDC 57.1 69.2 62.6 40.1 39.2 39.6 51.3 75.2 61.0 30.9 44.9 36.6
DSEN 59.1 71.1 64.5 39.4 41.4 40.4 56.4 80.4 66.3 31.6 52.1 39.4

Table 5. Conventional zeros-shot learning. The MCA (%) metric is used for comparison.
Methods CUB SUN AWA2 aPY
CAV(Zhang et al., 2017) 52.1 61.7 65.8 -
FGN(Xian et al., 2018b) 61.5 62.1 - -
SE-ZSL(Kumar Verma et al., 2018) 59.6 63.4 69.2 -
PSR-ZSL(Annadani and Biswas, 2018) 56.0 61.4 63.8 38.4
CDL(Jiang et al., 2018) 54.5 63.6 - 43.0
SP-AEN(Chen et al., 2018) 55.4 59.2 58.5 24.1
LDF(Li et al., 2018) 70.4 - - -
S2V 52.4 58.2 65.8 40.5
DSP 56.2 62.6 69.1 41.7
DDC 71.8 64.0 71.2 43.1
DSEN 71.8 62.2 72.3 43.5

To evaluate the effectiveness of domain-specific projections, we compare the DSP baseline with two representative methods (Annadani and Biswas, 2018; Jiang et al., 2018) which both employ a single shared semantic-visual projection. From Table 4, we see that the DSP baseline performs best on all four datasets in terms of metric H. The high performance demonstrates the superiority of our domain-specific projections to the single shared semantic projection. Comparing with (Annadani and Biswas, 2018; Jiang et al., 2018), the other advantage of DSEN is that it makes the visual features more discriminative. In this work, we define the domain shift degree as |𝑀𝐶𝐴s-𝑀𝐶𝐴t|. As the CUB for examples, we find that the domain shift degree for PSR-ZSL (Annadani and Biswas, 2018) and CDL (Jiang et al., 2018) are both larger than 30%. However, our DSEN only has a 12% domain shift degree. This low domain shift degree proves that our domain-specific projections can generate domain-robust embedding features.

Different from embedding-based PSR-ZSL and CDL, SE-ZSL (Xian et al., 2018b) and FGN (Kumar Verma et al., 2018) obtain state-of-the-art performance by alleviating the domain shift problem with synthetic visual data in an unseen domain. However, they all employ the widely used fully supervised learning that can degrade the recognition performance on real seen domain data, i.e., FGN (Kumar Verma et al., 2018) obtains a 10% drop of 𝑀𝐶𝐴s with synthetic data. Instead, the ddc used in our DDC can reduce the influence of noisy synthesized data. For example, in CUB dataset, the DDC obtains MCAs of 69.2%, which is higher than the values of 55.7% and 46.7% for FGN and SE-ZSL, respectively. As a consequence, the high MCAt and MCAs make DSEN obtain the highest H among all datasets, which also demonstrates its effectiveness in generalized ZSL.

Comparison with conventional zero-shot learning. Comparison with conventional ZSL setting is shown in Table 5, where the testing images only come from an unseen domain. Notably, conventional ZSL setting is easier than generalized ZSL due to it ignores the searching space from the seen domain. From Table 5, we can observe that the proposed DSEN obtains the best performance on four datasets. Also, the DDC has achieved higher performance than the existing methods on four datasets. It proves that the powerful and discriminative visual representations by the end-to-end trainable visual network are significant. Furthermore, compared to the MCAt in Table 4 in generalized zero-shot learning, we have found that the four baselines all obtain higher performance. The reason is that the conventional ZSL know prior information for the testing images belonging to which domains, which mitigates the projection domain shift problem.

Discussion. As shown in Table 4, DSEN achieves impressive improvement on CUB, aPY, and AWA2. However, it cannot obtain a consistent improvement on SUN dataset. The reason is that too many categories in SUN make it hard to generate good visual features from semantic attributes of low dimensions. More specifically, FGN uses GAN to generate synthetic visual features for an unseen domain, which is much more powerful than our two-layer generator. Thus, the distance between two generators is hard to remedy with domain-specific projections and classifiers, since there is a total of 717 categories in SUN. However, our DSEN finally obtains a slightly higher H value than FGN, due to an obviously higher MACs in the seen domain, which ensures the robustness of DSEN.

5. Conclusion

With an aim to solve the domain shift problem in generalized zero-shot learning, we propose a novel Domain-Specific Embedding Network by applying specific projections to seen and unseen domains based on domain characteristics. In contrast to existing methods using a single shared projection, we demonstrate that domain-specific projections can better capture domain similarities and differences, leading to more robust embedding features. To avoid domain-separated embedding space, a semantic reconstruction constraint is designed by using semantic labels to associate two specific projections in a cycle consistency way. Furthermore, a domain division constraint is developed to make the generated embedding features more distinguishable. Experiments on four benchmarks demonstrate the effectiveness of the proposed method.

In the future, powerful generators will be explored to provide more reliable synthetic visual representations, e.g., GAN. Also, domain-specific projection architectures will be explored by using autoML, which may yield further improvements.

6. Acknowledgement

This work is supported by the National Key Research and Development Program of China (2017YFC0820600), National Defense Science and Technology Fund for Distinguished Young Scholars (2017-JCJQ-ZQ-022), the National Nature Science Foundation of China (61525206,61771468,61622211,61620106009),the Youth Innovation Promotion Association Chinese Academy of Sciences (2017209), National Postdoctoral Programme for Innovative Talents (BX20180358), and the Fundamental Research Funds for the Central Universities (WK2100100030).

References

  • (1)
  • Akata et al. (2013) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2013. Label-embedding for attribute-based classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 819–826.
  • Akata et al. (2016) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38, 7 (2016), 1425–1438.
  • Akata et al. (2015) Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.
  • Annadani and Biswas (2018) Yashas Annadani and Soma Biswas. 2018. Preserving Semantic Relations for Zero-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7603–7612.
  • Bucher et al. (2017) Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. 2017. Generating visual representations for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision. 2666–2673.
  • Changpinyo et al. (2016) Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5327–5336.
  • Chen et al. (2018) Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. 2018. Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2.
  • Fang et al. (2018) Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 248–256.
  • Farhadi et al. (2009) Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1778–1785.
  • Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121–2129.
  • Fu et al. (2014) Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2014. Transductive multi-view embedding for zero-shot recognition and annotation. In European Conference on Computer Vision. Springer, 584–599.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • He and Peng (2018) Xiangteng He and Yuxin Peng. 2018. Only Learn One Sample: Fine-Grained Visual Categorization with One Sample Training. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1372–1380.
  • Jiang et al. (2018) Huajie Jiang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2018. Learning class prototypes via structure alignment for zero-shot recognition. In Proceedings of the European conference on computer vision. 118–134.
  • Jiang et al. (2017) Huajie Jiang, Ruiping Wang, Shiguang Shan, Yi Yang, and Xilin Chen. 2017. Learning discriminative latent attributes for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision. 4223–4232.
  • Kodirov et al. (2015) Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2015. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision. 2452–2460.
  • Kodirov et al. (2017) Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3174–3183.
  • Kumar Verma et al. (2018) Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281–4289.
  • Lampert et al. (2009) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.
  • Lampert et al. (2014) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 453–465.
  • Lazaridou et al. (2015) Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In the 7th International Joint Conference on Natural Language Processing), Vol. 1. 270–280.
  • Li et al. (2018) Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. 2018. Discriminative Learning of Latent Features for Zero-Shot Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7463–7471.
  • Long et al. (2018) Teng Long, Xing Xu, Youyou Li, Fumin Shen, Jingkuan Song, and Heng Tao Shen. 2018. Pseudo transfer with marginalized corrupted attribute for zero-shot learning. In 2018 ACM international conference on Multimedia. ACM, 1802–1810.
  • Mishra et al. (2018) Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. 2018. A generative model for zero shot learning using conditional variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2188–2196.
  • Morgado and Vasconcelos (2017) Pedro Morgado and Nuno Vasconcelos. 2017. Semantically consistent regularization for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 9. 10.
  • Niu et al. (2017) Yulei Niu, Zhiwu Lu, Songfang Huang, Xin Gao, and Ji-Rong Wen. 2017. FeaBoost: Joint Feature and Label Refinement for Semantic Segmentation. In AAAI. 1474–1480.
  • Palatucci et al. (2009) Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. In Advances in neural information processing systems. 1410–1418.
  • Patterson and Hays (2012) Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2751–2758.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532–1543.
  • Qiao et al. (2016) Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2016. Less is more: zero-shot learning from online textual documents with noise suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2249–2257.
  • Radovanović et al. (2010) Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, Sep (2010), 2487–2531.
  • Romera-Paredes and Torr (2015) Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning. 2152–2161.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • Shigeto et al. (2015) Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems. 935–943.
  • Song et al. (2018) Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. 2018. Transductive Unbiased Embedding for Zero-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1024–1033.
  • Tomasev et al. (2014) Nenad Tomasev, Milos Radovanovic, Dunja Mladenic, and Mirjana Ivanovic. 2014. The role of hubness in clustering high-dimensional data. IEEE transactions on knowledge and data engineering 26, 3 (2014), 739–751.
  • Wang et al. (2019) Chaojie Wang, Bo Chen, Sucheng Xiao, and Mingyuan Zhou. 2019. Convolutional Poisson Gamma Belief Network. In ICML.
  • Wang et al. (2018) Chaojie Wang, Bo Chen, and Mingyuan Zhou. 2018. Multimodal Poisson gamma belief network. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Welinder et al. (2010) Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).
  • Xian et al. (2016) Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 69–77.
  • Xian et al. (2018a) Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018a. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence (2018).
  • Xian et al. (2018b) Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018b. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5542–5551.
  • Xie et al. (2019) Hongtao Xie, Dongbao Yang, Nannan Sun, Zhineng Chen, and Yongdong Zhang. 2019. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognition 85 (2019), 109–119.
  • Yang et al. (2016) Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In Proceedings of the 24th ACM international conference on Multimedia. ACM, 1286–1295.
  • Zhang and Koniusz (2018) Hongguang Zhang and Piotr Koniusz. 2018. Zero-shot kernel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7670–7679.
  • Zhang et al. (2017) Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.
  • Zheng et al. (2018) Feng Zheng, Xin Miao, and Heng Huang. 2018. Fast vehicle identification via ranked semantic sampling based embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, 3697–3703.