Abstract
ZeroShot Learning (ZSL) seeks to recognize a sample from either seen orunseen domain by projecting the image data and semantic labels into a jointembedding space. However, most existing methods directly adapt a welltrainedprojection from one domain to another, thereby ignoring the serious biasproblem caused by domain differences. To address this issue, we propose a novelDomainSpecific Embedding Network (DSEN) that can apply specific projections todifferent domains for unbiased embedding, as well as several domainconstraints. In contrast to previous methods, the DSEN decomposes thedomainshared projection function into one domaininvariant and twodomainspecific subfunctions to explore the similarities and differencesbetween two domains. To prevent the two specific projections from breaking thesemantic relationship, a semantic reconstruction constraint is proposed byapplying the same decoder function to them in a cycle consistency way.Furthermore, a domain division constraint is developed to directly penalize themargin between real and pseudo image features in respective seen and unseendomains, which can enlarge the interdomain difference of visual features.Extensive experiments on four public benchmarks demonstrate the effectivenessof DSEN with an average of $9.2\%$ improvement in terms of harmonic mean. Thecode is available in \url{https://github.com/mboboGO/DSENforGZSL}.
Quick Read (beta)
DomainSpecific Embedding Network for ZeroShot Recognition
Abstract.
ZeroShot Learning (ZSL) seeks to recognize a sample from either seen or unseen domain by projecting the image data and semantic labels into a joint embedding space. However, most existing methods directly adapt a welltrained projection from one domain to another, thereby ignoring the serious bias problem caused by domain differences. To address this issue, we propose a novel DomainSpecific Embedding Network (DSEN) that can apply specific projections to different domains for unbiased embedding, as well as several domain constraints. In contrast to previous methods, the DSEN decomposes the domainshared projection function into one domaininvariant and two domainspecific subfunctions to explore the similarities and differences between two domains. To prevent the two specific projections from breaking the semantic relationship, a semantic reconstruction constraint is proposed by applying the same decoder function to them in a cycle consistency way. Furthermore, a domain division constraint is developed to directly penalize the margin between real and pseudo image features in respective seen and unseen domains, which can enlarge the interdomain difference of visual features. Extensive experiments on four public benchmarks demonstrate the effectiveness of DSEN with an average of $9.2\%$ improvement in terms of harmonic mean. The code is available in https://github.com/mboboGO/DSENforGZSL.
1. Introduction
Traditional recognition tasks has progressed with the help of massive labeled images and deep models (Simonyan and Zisserman, 2014; He et al., 2016; He and Peng, 2018; Fang et al., 2018; Zheng et al., 2018; Wang et al., 2018; Xie et al., 2019; Wang et al., 2019). However, their major disadvantage is that they cannot recognize the images belonging to unseen categories, and it is laborious to collect sufficient labeled images for various tasks. To tackle this problem, generalized ZeroShot Learning (ZSL) (Palatucci et al., 2009; Akata et al., 2013; Lampert et al., 2014; RomeraParedes and Torr, 2015; Yang et al., 2016; Morgado and Vasconcelos, 2017; Xian et al., 2018a; Long et al., 2018) has attracted a lot of attention in recent years. A generalized zeroshot recognition is defined as recognizing a sample from either seen or unseen domain, which contains disjoint categories. A general paradigm is to project the image data and semantic labels, e.g., category attributes (Farhadi et al., 2009; Lampert et al., 2009), into a joint embedding space, where recognition becomes a nearest neighbor searching problem (Song et al., 2018). A visual diagram is shown in Figure 1. The major challenge of this paradigm is that the different data distributions between two domains lead to serious domain shift problems (Fu et al., 2014; Kodirov et al., 2015), which make the embedding features biased towards the seen domain.
To address this issue, existing methods focus on learning a robust projection between visual representations and semantic labels. The related methods can be coarsely classified into two classes: embeddingbased framework and semanticpreserving framework. The embeddingbased framework (Tomasev et al., 2014; RomeraParedes and Torr, 2015; Akata et al., 2016; Jiang et al., 2017; Kumar Verma et al., 2018) aims to establish a discriminative embedding space that is shared across two domains. Two commonly used embedding space is spanned by visual representations (Zhang et al., 2017) and semantic labels (Kumar Verma et al., 2018), respectively. Taking Figure 2 (a) as an example, the semantic labels are projected into the visual space to match with the corresponding visual representations, which has been proved robust to Hubness problems (Tomasev et al., 2014; Lazaridou et al., 2015). Different from the embeddingbased framework, the semanticpreserving framework (Kodirov et al., 2017; Annadani and Biswas, 2018; Chen et al., 2018; Xian et al., 2018a) focuses on preserving the semantic prototype in an embedding space through an autoencoder architecture. An example is shown in Figure 2 (b). Although the above methods are effective for zeroshot problems, they all employ a single sharedprojection for both seen and unseen domains. Due to the broad domain gap, a shared projection function cannot model the full specialty of each domain, leading to biased recognition problem.
In this paper, we propose a novel DomainSpecific Embedding Network (DSEN) to alleviate the domain shift problem in ZSL by applying specific projections to different domains, as well as several domain constraints. The novelties of DSEN over the previous methods are shown in Figure 2 (c). Instead of a single shared projection, DSEN decomposes the projection function into three components: domaininvariant projection ${\varphi}_{c}$, seen domainspecific projection ${\varphi}_{s}$, and unseen domainspecific projection ${\varphi}_{t}$. The ${\varphi}_{c}$ targets to capture the common projection knowledge between two domains, and ${\varphi}_{s}$ and ${\varphi}_{t}$ are used to capture the domainspecific projection knowledge. Notably, ${\varphi}_{s}$ and ${\varphi}_{t}$ should project the semantic labels of different domains into a shared embedding space $E$ in Figure 2 (c) for crossdomain recognition. To this end, a semantic reconstruction constraint is designed, by applying the same decoder function to both ${\varphi}_{s}$ and ${\varphi}_{t}$ in a cycle consistency way, to preserve the shared semantic relationship in $E$. Compared to using a single shared projection, our domainspecific projections can generate less biased embedding features due to domain specialty modeling.
Furthermore, a domain division constraint is developed to enlarge both intra and interdomain discrimination of visual features, based on pseudo visual data in the unseen domain. Besides fully supervised learning in the seen domain, our domain division constraint restricts the noisy pseudo data to have a uniform label distribution in seen categories. The advantages of this constraint mechanism are as follows: a) being robust to the noisy pseudo visual features; b) directly enlarging the visual margin between two domains; and c) can be trained endtoend. Consequently, the decision boundary of visual features between two domains becomes more clear, which allows us to utilize specific classifiers in determinate searching space for unbiased recognitions.
Our contributions are threefold:

•
We propose a novel DomainSpecific Embedding Network (DSEN) by applying specific projections to two domains, which can better capture domain similarities and differences for unbiased embedding.

•
A domain division constraint is designed to effectively enhance both intra and interdomain discrimination, based on real and pseudo visual data in two domains. Besides, it also enables DSEN to be trained endtoend.

•
The proposed DSEN obtains the sateoftheart performance on four public datasets with an average of 9.2% improvement in terms of harmonic mean.
2. Related Work
Three types of related techniques are discussed in this section.
2.1. Embeddingbased ZeroShot Learning
A general paradigm of zeroshot recognition targets to project the image representations and semantic labels into a joint embedding space, where the recognition becomes a nearest neighbor searching problem (Song et al., 2018). This process is called as an embeddingbased method, which is one of the most popular ZSL strategies. As the seen and unseen domains have disjointed categories, the additional semantic information, such as attributes (Farhadi et al., 2009; Lampert et al., 2009) and word vectors (Pennington et al., 2014; Niu et al., 2017), are used to construct a relationship between these two domains. Among these methods, Frome et al. (Frome et al., 2013) and Akata et al. (Akata et al., 2016) use the bilinear embedding model trained with a pairwise ranking loss. The ESZSL model (RomeraParedes and Torr, 2015) constructs an embedding space with a Frobenius norm regularization, and Qiao et al. (Qiao et al., 2016) extend this work to online documents by suppressing the noise with an extra ${l}_{1,2}$ norm. In addition, Akata et al. (Akata et al., 2015) build a joint embedding space with several compatibility functions, which is improved in (Xian et al., 2016) by incorporating latent variables. Zhang et al. (Zhang and Koniusz, 2018) employ a nonlinear kernel to generate a mapping between visual representations and attributes. In spite of the promising performance, the above methods directly project the visual representations into space spanned by semantic labels, which suffer from Hubness problems (Radovanović et al., 2010; Tomasev et al., 2014; Lazaridou et al., 2015). The Hubness problem is defined as a few points being the nearest neighbors of most of the other points, which is caused by that projecting a visual feature with high dimensions into an attributes space with low dimensions shrinks the variance of the projected data points (Zhang et al., 2017). Therefore, a few methods (Shigeto et al., 2015; Zhang et al., 2017; Xian et al., 2018a) use an embedding space spanned by visual features, which is defined as a semanticvisual embedding. Although the previous methods are effective, insufficient semantic embedding limits their further applications due to serious domain shift problems. For example, a testing sample from an unseen domain tends to be recognized from one of the seen categories.
2.2. SemanticPreserving Framework
To alleviate the domain shift problems, many recent works target to preserve the semantic prototype in an embedding space. The motivation under this tactic is that the semantic prototype is robust to domain change, which is beneficial to train a robust projection function. Among these methods, SAE (Kodirov et al., 2017) and SP_AEN (Chen et al., 2018) use an autoencoder architecture on the embedded space to make their embedding features discriminative. Jiang et al. (Jiang et al., 2018) propose a coupled dictionary learning model to preserve the visualsemantic structures with semantic prototypes. Especially, Annadani et al. (Annadani and Biswas, 2018) preserve the semantic relationships in visual space by decomposing the relation between categories into three groups. Although the embeddingbased methods and semanticpreserving framework can effectively solve the ZSL problems, they are mainly based on using a single shared projection across two domains, which ignore the large domain differences.
2.3. Synthetic DataBased Methods
Recently, synthetic databased methods (Bucher et al., 2017; Mishra et al., 2018; Long et al., 2018; Kumar Verma et al., 2018; Xian et al., 2018b) have been proposed, and they have obtained stateoftheart performance. In contrast to embeddingbased methods, they train a softmax classifier with full supervision on the union of real and synthetic visual data from seen and unseen domains. The synthetic visual data is obtained by a specific generator, such as GAN (Goodfellow et al., 2014) and its variants, based on the unseen domain attributes. Consequently, their models are more robust to domain shifts than embeddingbased methods, based on both seen and unseen visual features. However, fullysupervised learning is sensitive to noisy synthetic visual data, which has not been fully exploited.
3. DomainSpecific Embedding Network
We first describe the problem formulation of the DomainSpecific Embedding Network in Sec. 3.1 and then provide a detailed implementation of DSEN. The pipeline is shown in Figure 3.
3.1. Problem Formulation
Let $\mathcal{S}=\{({x}_{s},{y}_{s},{\bm{a}}_{s}){x}_{s}\in {\mathcal{X}}_{s},{y}_{s}\in {\mathcal{Y}}_{s},{\bm{a}}_{s}\in {\mathcal{A}}_{s}\}$ represents the seen domain dataset, where ${y}_{s}$ and ${\bm{a}}_{s}$ are the class labels and semantic attributes for each image ${x}_{s}$, respectively. $\mathcal{T}=\{({x}_{t},{y}_{t},{\bm{a}}_{t}){x}_{t}\in {\mathcal{X}}_{t},{y}_{t}\in {\mathcal{Y}}_{t},{\bm{a}}_{t}\in {\mathcal{A}}_{t}\}$ is similarly defined as the unseen domain dataset, where ${\mathcal{Y}}_{s}\cap {\mathcal{Y}}_{t}=\mathrm{\varnothing}$. Given the seen domain data $\mathcal{S}$ and unseen domain labels ${\mathcal{Y}}_{t}$ with attributes ${\mathcal{A}}_{t}$, the target of a generalized ZSL task is to recognize an image from either ${\mathcal{X}}_{s}$ or ${\mathcal{X}}_{t}$.
Based on the above definition, a basic objective for our DSEN is to:
(1)  $\underset{{W}_{\varphi}}{\mathrm{min}}{\displaystyle \sum _{{x}_{s}\in {\mathcal{X}}_{s}}}d(f({x}_{s}),\varphi ({\bm{a}}_{s})),$ 
where $f(\cdot )$ is the visual feature extractor for visual images. $\varphi (\cdot )$ is a semanticvisual projection function with trainable weights ${W}_{\varphi}$. Notably, $\varphi ({\bm{a}}_{s})$ is the semantic embedding. The distance function $d(\cdot )$ computes the negative cosine distance between two features ${\bm{v}}_{1}$ and ${\bm{v}}_{2}$ by:
(2)  $$ 
3.2. DomainSpecific Projections
One leading cause of the domain shift problem is that a shared ${\varphi}_{c}$ cannot model the full differences between two domains, thereby making generated embedding features towards the seen domain. Targeting to model the differences between two domains, we decompose the projection function into three parts, which are domaininvariant projection ${\varphi}_{c}$, seen domainspecific projection ${\varphi}_{s}$, and unseen domainspecific projection ${\varphi}_{t}$. Thus, the embedding features from seen and unseen domains become the combination of two subfeatures:
(3)  $\varphi (\bm{a})=\{\begin{array}{c}\hfill {\varphi}_{s}(\bm{a})+{\varphi}_{c}(\bm{a})\hspace{1em}\hspace{1em}if\bm{a}\in {\mathcal{A}}_{s},\\ \hfill {\varphi}_{t}(\bm{a})+{\varphi}_{c}(\bm{a})\hspace{1em}\hspace{1em}if\bm{a}\in {\mathcal{A}}_{t},\end{array}$ 
where ${\varphi}_{c}$ is used to capture the common knowledge between two domains, and ${\varphi}_{s}$ and ${\varphi}_{t}$ capture the specific characteristics for seen and unseen domains, respectively. Compared to existing methods that use a single shared projection, the additional domainspecific projections can better accommodate domain differences, yielding more discriminative embedding features. By taking the ${\varphi}_{c}$ and ${\varphi}_{s}$ into consideration, Eq. (1) becomes minimizing:
(4)  ${\mathcal{L}}_{svs}={\displaystyle \sum _{{x}_{s}\in {\mathcal{X}}_{s}}}d(f({x}_{s}),{\varphi}_{s}({\bm{a}}_{s})+{\varphi}_{c}({\bm{a}}_{s})).$ 
Different from ${\varphi}_{s}$, ${\varphi}_{t}$ is difficult to train due to unavailable $f({x}_{t})$. Specifically, it is hard to constrain ${\bm{a}}_{s}$ and ${\bm{a}}_{t}$ to be projected into a shared embedding space using different ${\varphi}_{s}$ and ${\varphi}_{t}$, for cross domain recognition. To achieve this goal, a semantic reconstruction constraint ${\mathcal{L}}_{sr}$ is designed by applying the same decoder function to both ${\varphi}_{s}+{\varphi}_{c}$ and ${\varphi}_{t}+{\varphi}_{c}$ for semantic label reconstruction. The motivation is that the semantic labels are shared across two domains; thus ${\mathcal{L}}_{sr}$ can constrain the semantic embedding from two specific projections to be associated in a shared embedding space. First, ${\varphi}_{t}$ is initialized on the welltrained ${\varphi}_{s}$ for projection knowledge transfer. Then, the semantic information in ${\bm{a}}_{s}$ and ${\bm{a}}_{t}$ are simultaneously encoded into ${\varphi}_{s}$ and ${\varphi}_{t}$ in a semantic cycle consistency way. Notably, using ${\varphi}_{s}$ as the initialized ${\varphi}_{t}$ can facilitate the convergence. Consequently, ${\mathcal{L}}_{sr}$ enables ${\varphi}_{t}$ to capture the effective projection knowledge in the unseen domain based on ${\mathcal{A}}_{t}$, which will be illustrated subsequently.
Inspired by the applications of autoencoding architecture in unsupervised representation learning, a domainspecific autoencoder architecture is used to encode the semantic information in both ${\varphi}_{t}$ and ${\varphi}_{s}$ by:
(5)  $\begin{array}{cc}\hfill {\mathcal{L}}_{sr}=& {\displaystyle \sum _{{\bm{a}}_{s}\in {\mathcal{A}}_{s}}}{{\varphi}_{sr}({\varphi}_{s}({\bm{a}}_{s})+{\varphi}_{c}({\bm{a}}_{s})){\bm{a}}_{s}}_{2}^{2}\hfill \\ & +{\displaystyle \sum _{{\bm{a}}_{t}\in {\mathcal{A}}_{t}}}{{\varphi}_{sr}({\varphi}_{t}({\bm{a}}_{t})+{\varphi}_{c}({\bm{a}}_{t})){\bm{a}}_{t}}_{2}^{2},\hfill \end{array}$ 
where ${\varphi}_{sr}$ is a shared decoder function for both domains. From Eq. (5), ${\varphi}_{c}$ has access to the semantic information in two domains, which can capture the domain similarity information. ${\varphi}_{s}$ and ${\varphi}_{t}$ only have access to domainspecific information, thereby rendering them to capture specific characteristics of two domains.
Finally, the objective function for domainspecific projections becomes:
(6)  $\underset{{W}_{{\varphi}_{c}},{W}_{{\varphi}_{s}},{W}_{{\varphi}_{t}},{W}_{{\varphi}_{sr}}}{\mathrm{min}}{\mathcal{L}}_{svs}+{\lambda}_{1}{\mathcal{L}}_{sr},$ 
where ${\lambda}_{1}$ is a hyperparameter used to balance different constraints. All the encoders ${\varphi}_{c}$, ${\varphi}_{s}$, ${\varphi}_{t}$ and decoder ${\varphi}_{sr}$ are implemented with two fully connection layers followed by ReLU activation.
Consequently, the domainspecific projections ${\varphi}_{s}$ and ${\varphi}_{t}$ assist ${\varphi}_{c}$ to generate less biased embedding features during semanticvisual projection. The detailed architecture of our domainspecific projections is shown in Figure. 3.
3.3. Domain Division Constraint
Based on the embedding features from semantic attributes, we further propose a domain division constraint to make the embedding features between two domains distinguishable.
To achieve this goal, we first generate pseudo visual features from category attributes ${\bm{a}}_{t}$ for the unseen domain. Especially, we regard $\varphi ({\bm{a}}_{t})$ as pseudo visual features, because $\varphi ({\bm{a}}_{t})$ and $f({x}_{t})$ have similar distributions based on welltrained semanticvisual projections. With the real visual features $f({x}_{s})$ and pseudo visual features $\varphi ({\bm{a}}_{t})$, it is intuitive to train a ${\mathcal{Y}}_{s}\cup {\mathcal{Y}}_{t}$way softmax classifier, which can recognize visual samples from either seen or unseen domain. However, $\varphi ({\bm{a}}_{t})$ is usually too noisy for a model to use fully supervised learning, which may deteriorate the model performance in the seen domain. Therefore, DDC just constraints the noisy pseudo features $\varphi ({\bm{a}}_{t})$ to be far away from the seen categorizes, because ${\mathcal{Y}}_{s}$ and ${\mathcal{Y}}_{t}$ are disjoint. Thus, a ${\mathcal{Y}}_{s}$way softmax classifier $p$ is trained by minimizing:
(7)  ${\mathcal{L}}_{ddc}={\displaystyle \sum _{{x}_{s}\in {\mathcal{X}}_{s}}}ln{p}_{y*}(f({x}_{s}))+\alpha {\displaystyle \sum _{{\bm{a}}_{t}\in {\mathcal{A}}_{t}}}ln\widehat{p}(\varphi ({\bm{a}}_{t})),$ 
where ${p}_{y}(\cdot )$ is the classification score in terms of the ground truth label $y*$, and $\widehat{p}(\cdot )$ is the maximum classification score in ${\mathcal{Y}}_{s}$. The first term in Eq. (7) is a general crossentropy softmax loss in the seen domain. The second term forces $\varphi ({\bm{a}}_{t})$ to have a uniform label distribution in ${\mathcal{Y}}_{s}$, which means that the $\varphi ({\bm{a}}_{t})$ should not be recognized as a seen category.
Datasets  Attributes  ${\mathcal{Y}}_{s}$  ${\mathcal{Y}}_{t}$  train  val  test 

CUB  312  150  50  7,057  1,764  2,967 
SUN  102  645  72  10,320  2,580  1,440 
AWA2  85  40  10  23,527  5,882  7,913 
aPY  64  20  12  5,932  1,483  7,924 
$\alpha $ is a hyperparameter that is used to balance the training effects between real and pseudo visual features on classifier $p$.
Based on Eq (7), the decision boundary between $f({x}_{s})$ and $f({x}_{t})$ can be determined by judging whether the label distribution of an input sample is smooth in ${\mathcal{Y}}_{s}$. Especially for an input image from the seen domain, $\widehat{p}(f(x))$ should be extremely large in terms of the true label. Conversely, the $\widehat{p}(f(x))$ should be small, indicating a uniform label distribution for an image from the unseen domain. Furthermore, since the welltrained classifier $p$ can only do categorization in the seen domain, we employ a rankingbased classifier, which is proposed based on nearest neighbor searching, to those samples that are suspected from the unseen domain. Thus, the final inference of our DSEN can be expressed by:
(8)  $\widehat{y}=\{\begin{array}{cc}& \mathrm{arg}\underset{y\in {\mathcal{Y}}_{s}}{\mathrm{max}}{p}_{y}(f(x))\mathit{\hspace{1em}\hspace{1em}}if\widehat{p}(f(x))>\tau \hfill \\ & \mathrm{arg}\underset{y\in {\mathcal{Y}}_{t}}{\mathrm{min}}d(f(x),\varphi ({\bm{a}}_{t}))\hspace{1em}else,\hfill \end{array}$ 
where $x\in {\mathcal{X}}_{s}\cup {\mathcal{X}}_{t}$, and $\widehat{y}$ is the final prediction. $\tau $ is a threshold to determine the domain of an input sample.
With Eq. (8), we can divide the searching space for any samples into two subspaces. Once the samples coming from the unseen domain, the rankingbased classifier is used for recognition, of which the search space has been reduced to the unseen domain. For samples from the seen domain, the softmax classifier $p$ can directly give the confident category predictions. By reducing the search space, the recognitions in both domains will be measurably improved, which is attributed to our domain division constraint ${\mathcal{L}}_{ddc}$.
3.4. Overall Objective
Finally, the overall objective function of DSEN becomes:
(9)  $\underset{{W}_{{\varphi}_{c}},{W}_{{\varphi}_{s}},{W}_{{\varphi}_{t}},{W}_{{\varphi}_{sr}},{W}_{f}}{\mathrm{min}}{\mathcal{L}}_{svs}+{\lambda}_{1}{\mathcal{L}}_{sr}+{\lambda}_{2}{\mathcal{L}}_{ddc},$ 
where ${W}_{f}$ is the trainable parameters of visual feature extraction function $f(\cdot )$. ${\lambda}_{1}$ and ${\lambda}_{2}$ balance different constraints. Notably, in many existing ZSL methods (Jiang et al., 2018; Chen et al., 2018; Xian et al., 2018b), $f(\cdot )$ is fixed across different datasets, leading to a weak visual representation $f(x)$. Instead, ${\mathcal{L}}_{ddc}$ enables DSEN to be trained endtoend with a trainable $f(\cdot )$. Therefore, both visual representations and embedding features from our DSEN are powerful and discriminative.
4. Experiments
In this section, experimental analysis on four benchmarks is given to evaluate the proposed DSEN.
4.1. Experimental Settings
Datasets. We evaluate the proposed method on four widely used benchmarks: CaltechUSCD Birds2002011 (CUB) (Welinder et al., 2010), SUN (Patterson and Hays, 2012), Animals with Attributes 2 (AwA2) (Xian et al., 2018a), and Attribute Pascal and Yahoo (aPY) (Farhadi et al., 2009). All the datasets provide annotated attributes. The newly proposed splits of seen/unseen classes in (Xian et al., 2018a) are used for fair comparisons, which ensure that the test categories are strictly unseen in the pretrained visual projection network (Russakovsky et al., 2015). The details of the datasets are listed in Table 1.
Setting  ${L}_{svs}$  ${L}_{sr}$  ${L}_{ddc}$ 

S2V  $\surd $  
DSP  $\surd $  $\surd $  
DDC  $\surd $  $\surd $  
DSEN (DSP+DDC)  $\surd $  $\surd $  $\surd $ 
Implementation details. The input images are resized to $480$ along the short side, with data augmentation of $448\times 448$ random cropping and horizontal flipping. The visual feature extraction network $f(\cdot )$ is based on the ResNet101 architecture, which is pretrained on the ImageNet dataset. The rest of the networks uses MSRA random initializer (He et al., 2016). In this work, we employ a twostage training strategy to train the proposed DSEN. It first fixes $f(\cdot )$ and trains the rest with a large learning rate $lr=1\times {e}^{3}$, and then it uses a small $lr=1\times {e}^{5}$ to train the whole DSEN. The Adam optimizer is used with $\beta =(0.5,0.999)$ and weight decay $5\times {e}^{5}$. For the hyperparameters in DSEN, we set ${\lambda}_{1}=5$ and ${\lambda}_{2}=1$ to balance ${\mathcal{L}}_{svs}$, ${\mathcal{L}}_{sr}$, and ${\mathcal{L}}_{ddc}$, and $\alpha =0.1$ in ${\mathcal{L}}_{ddc}$. The above hyperparameter settings are determined according to experiments, and they are applicable to all of our experimental datasets. $\tau $ will be analyzed in the ablation study.
Evaluation metrics. Similar to (Xian et al., 2018a), the harmonic mean ($H$) is denoted in Eq. (10) to evaluate a model by:
(10)  $\begin{array}{c}\hfill H={\displaystyle \frac{2\times {\text{\mathit{M}\mathit{C}\mathit{A}}}_{t}\times {\text{\mathit{M}\mathit{C}\mathit{A}}}_{s}}{{\text{\mathit{M}\mathit{C}\mathit{A}}}_{t}+{\text{\mathit{M}\mathit{C}\mathit{A}}}_{s}}},\end{array}$ 
where MCA${}_{s}$ and MCA${}_{t}$ are the Mean Class top1 Accuracy for the validation (seen) and testing (unseen) sets, respectively.
In the following parts, the experiments are mainly conducted under generalized ZSL settings, where the testing images come from either the seen or unseen domain.
Baselines. To demonstrate the effectiveness of different components in DSEN, three baselines are defined:

•
S2V is a general semanticvisual structure with shared projection function ${\varphi}_{c}$. The visual feature extraction function $f(\cdot )$ is fixed.

•
DSP adds two extra domainspecific projections ${\varphi}_{s}$ and ${\varphi}_{t}$ to S2V.

•
DDC applies the domain division constraint ${\mathcal{L}}_{ddc}$ to S2V, which makes the visual feature extractor $f(\cdot )$ trainable.
Finally, DSEN uses both domainspecific projections and ${\mathcal{L}}_{ddc}$ with a trainable $f(\cdot )$. The details of each baseline are listed in Table 2.
Baseline  ${\varphi}_{c}$  ${\varphi}_{s}$  ${\varphi}_{t}$  MCA${}_{t}$  MCA${}_{s}$ 

S2V  $\surd $  25.6  56.6  
$\surd $  $\surd $  27.5  61.9  
$\surd $  $\surd $  28.3  57.4  
$\surd $  $\surd $  29.7  60.1  
$\surd $  $\surd $  $\surd $  30.8  62.7 
4.2. Ablation Studies
Effects of ${\varphi}_{c}$, ${\varphi}_{s}$, and ${\varphi}_{t}$. As the domainspecific projections consist of one domainshared ${\varphi}_{c}$ and two domainspecific ${\varphi}_{s}$ and ${\varphi}_{t}$, we explore their effects by individually applying them to the baseline S2V. Table 3 shows the results. From Table 3, it is observed that applying ${\varphi}_{s}$ and ${\varphi}_{t}$ individually to ${\varphi}_{c}$ yields improvements by $5.3\%$ and $2.7\%$ in terms of $MA{C}_{s}$ and $MA{C}_{t}$. This proves that the domainspecific projections effectively capture their characteristic domain information via semantic reconstruction constraint ${\mathcal{L}}_{sr}$. Then, we further explore the effects by using totally separated ${\varphi}_{s}$ and ${\varphi}_{t}$ without ${\varphi}_{c}$. The results show a slight improvement of $1.5\%$ on $MA{C}_{t}$. The reason is that the connection between ${\varphi}_{s}$ and ${\varphi}_{t}$ is too weak to guarantee the projected embedding features to be in the same embedding space. Finally, combining ${\varphi}_{s}$, ${\varphi}_{t}$, and ${\varphi}_{c}$ offers the best performance, which indicates that ${\varphi}_{c}$ successfully captures the domain similarities during semanticvisual projection. These experiments prove the effectiveness of domainspecific projections in generating discriminative embedding features. In addition, random initialization of ${\varphi}_{t}$ yields a relatively slow convergence speed.
Effects of domainspecific projections. As the domainspecific projections play a critical role in the proposed DSEN, we analyze the effect of applying domainspecific projections to different baselines. The related results are summarized in Table 4. From Table 4, we observe that applying domainspecific projections achieves better performance than using a single shared projection for all datasets, e.g., the DSP and DSEN both achieve $6.0\%$ and $1.9\%$ improvements on the S2V and DDC baselines in terms of $H$ on CUB, respectively. Table 4 further shows that the domainspecific projections improve the recognition performance on both seen and unseen domains by about $2\%\sim 6\%$ on CUB. These achievements demonstrate the effectiveness of the proposed domainspecific projections.
Comparison of fullysupervised learning and ${\mathrm{L}}_{d\mathbf{}d\mathbf{}c}$. We further analyze the superiority of our domain division constraint ${\mathcal{L}}_{ddc}$ to fullysupervised learning. The analysis is performed by using noisy pseudo visual data for supervised training. Given a ZSL model, we denote $\widehat{p}(f(x))$ as the maximum score of an image among seen categories. Thus, for an image from the seen domain, the $\widehat{p}(f(x))$ should be extremely large in terms of the true label. Conversely, the $\widehat{p}(f(x))$ should be small, indicating a uniform label distribution for an image from the unseen domain. To this end, we compare the $\widehat{p}(f(x))$ for all unseen domain samples by individually applying fullysupervised learning and our ${\mathcal{L}}_{ddc}$ to baseline DSP with noisy pseudo data. The results are reported in Figure 4.
From Figure 4, it can be observed that, compared to fully supervised learning, ${\mathcal{L}}_{ddc}$ improves the percentage of samples with a small $$ from $50\%$ to $70\%$ on CUB, $24\%$ to $42\%$ on AWA2, and $30\%$ to $50\%$ on aPY, approximatively. On SUN, the percentage of samples with $$ is improved from $22\%$ to $43\%$. Notably, more samples with a small $\widehat{p}(f(x))$ in Figure 4 mean that more unseen domain samples can be distinguished from the seen domain samples. Therefore, compared to fully supervised learning, ${\mathcal{L}}_{ddc}$ makes the embedding features between two domains more distinguishable, which accounts for our impressive performance. Furthermore, it also shows that the softmax classifier $p$ can model the decision boundary between two domain. Thus, using domainspecific classifiers are reasonable.
Effects of varying $\tau $ values. $\tau $ is a critical parameter to judge whether a testing sample is from an unseen domain. The results of varying $\tau $ values are shown in Figure 5. It can be found that, for the seen domain samples, a higher $\tau $ leads to a lower $MC{A}_{s}$. The reason is that a higher $\tau $ may mistakenly give some seen domain samples to the rankingbased classifier, which degrades the $MC{A}_{s}$. Conversely, in the unseen domain, the larger value the $\tau $ is, the higher $MC{A}_{t}$ our DSEN obtains. The reason is that a higher $\tau $ will feed a large number of unseen domain samples to the rankingbased classifier, which is good at unseen domain categorization and consequently improves the $MA{C}_{t}$. As the metric $H$ is the combination of MCA${}_{s}$ and MCA${}_{t}$, $H$ does not have a consistent trend. With an increase of $\tau $, $H$ first increases to the optimal value and then drops. From Figure 5, it can be found that the optimal values for $\tau $ in different datasets are different, e.g., the optimal values for $\tau $ are 0.8, 0.5, 0.9, 0.8 for CUB, SUN, AWA2,and aPY, respectively.
Effects of domain division constraint ${\mathrm{L}}_{d\mathbf{}d\mathbf{}c}$. We then verify the effectiveness of using domain division constraint ${\mathcal{L}}_{ddc}$. As shown in Table 4, the baseline DDC obtains a higher harmonic mean ($H$) than S2V on all four datasets. For example, on the AWA2 dataset, the DDC raises a harmonic mean ($H$) from $39.8\%$ to $60.1\%$ over S2V, which is mainly attributed to the significant $24.5\%$ improvement of $MA{C}_{t}$. Further, with the domain division constraint ${\mathcal{L}}_{ddc}$, the DSEN obtains a higher performance than DSP. These improvements confirm that the ${\mathcal{L}}_{ddc}$ can effectively make the embedding features more distinguishable, thereby rendering improved recognition in challenging unseen domains.
Furthermore, two domainspecific classifiers, based on ${\mathcal{L}}_{ddc}$, also yield significant contributions to our impressive performance. As shown in Table 4, the main improvements of our DDC come from the high $MA{C}_{t}$ in the unseen domain, which shows that the searching space reduction of the rankingbased classifier is an important factor of performance improvements. Especially, on the CUB dataset, the metric H of DDC is 12.9% higher than FGN (Kumar Verma et al., 2018) in terms of H, where the FGN uses a single softmax classifier in two domains. This also proves that using two domainspecific classifiers based on ${\mathcal{L}}_{ddc}$ is superior to using single shared classifier.
Feature visualizations of DSEN. Figure 6 shows the tSNE of generated visual features by DSEN on CUB and AWA2 datasets, respectively. In each dataset, total $10$ categories are randomly selected from the unseen domain. From the results, DSEN can not only preserve the semantic relationship in the embedding space but also obtain a large interclass discrimination. This is attributed to DSP that captures accurate domain difference and DDC that enlarges the domain difference.
4.3. Comparison with existing methods
Comparison with generalized zeroshot learning. Table 4 illustrates comparison with previous methods on generalized ZSL. As shown in Table 4, our DSEN significantly outperforms existing methods on four datasets, e.g., DSEN obtains $15.0\%$, $1\%$, $3.5\%$, and $16.8\%$ improvement in terms of metric $H$ on CUB, SUN, AWA2, and aPY, respectively.
Methods  CUB (Welinder et al., 2010)  SUN (Patterson and Hays, 2012)  AWA2 (Xian et al., 2018a)  aPY (Farhadi et al., 2009)  

MCA${}_{t}$  MCA${}_{s}$  $H$  MCA${}_{t}$  MCA${}_{s}$  $H$  MCA${}_{t}$  MCA${}_{s}$  $H$  MCA${}_{t}$  MCA${}_{s}$  $H$  
NG  CMT(Socher et al., 2013)  7.2  49.8  12.6  8.1  21.8  11.8  0.5  90.0  1.0  1.4  85.2  2.8 
SYNC(Changpinyo et al., 2016)  11.5  70.9  19.8  7.9  43.3  13.4  10.0  90.5  18.0  7.4  66.3  13.3  
SAE(Kodirov et al., 2017)  7.8  54.0  13.6  8.8  18.0  11.8  1.1  82.2  2.2  0.4  80.9  0.9  
KL(Zhang and Koniusz, 2018)  19.9  52.5  28.9  19.8  29.1  23.6  17.6  80.9  29.0  11.9  76.3  20.5  
PTZSL(Long et al., 2018)  23.0  51.6  31.8  19.0  32.7  24.0        15.4  71.3  25.4  
CDL(Jiang et al., 2018)  23.5  55.2  32.9  21.5  34.7  26.5        19.8  48.6  28.1  
PSRZSL(Annadani and Biswas, 2018)  24.6  54.3  33.9  20.8  37.2  26.7  20.7  73.8  32.2  13.5  51.4  21.4  
SPAEN(Chen et al., 2018)  34.7  70.6  46.6  24.9  38.6  30.3  23.3  90.9  37.1  13.7  63.4  22.6  
G  SEZSL(Kumar Verma et al., 2018)  41.5  53.3  46.7  40.9  30.5  34.9  58.3  68.1  62.8       
FGN(Xian et al., 2018b)  43.7  57.7  49.7  42.6  36.6  39.4              
S2V  25.6  56.6  35.3  20.1  35.3  26.2  25.6  88.9  39.8  15.5  73.6  25.7  
DSP  30.8  62.7  41.3  30.0  40.3  34.4  31.2  87.9  46.1  18.1  73.1  29.0  
DDC  57.1  69.2  62.6  40.1  39.2  39.6  51.3  75.2  61.0  30.9  44.9  36.6  
DSEN  59.1  71.1  64.5  39.4  41.4  40.4  56.4  80.4  66.3  31.6  52.1  39.4  

Methods  CUB  SUN  AWA2  aPY 

CAV(Zhang et al., 2017)  52.1  61.7  65.8   
FGN(Xian et al., 2018b)  61.5  62.1     
SEZSL(Kumar Verma et al., 2018)  59.6  63.4  69.2   
PSRZSL(Annadani and Biswas, 2018)  56.0  61.4  63.8  38.4 
CDL(Jiang et al., 2018)  54.5  63.6    43.0 
SPAEN(Chen et al., 2018)  55.4  59.2  58.5  24.1 
LDF(Li et al., 2018)  70.4       
S2V  52.4  58.2  65.8  40.5 
DSP  56.2  62.6  69.1  41.7 
DDC  71.8  64.0  71.2  43.1 
DSEN  71.8  62.2  72.3  43.5 
To evaluate the effectiveness of domainspecific projections, we compare the DSP baseline with two representative methods (Annadani and Biswas, 2018; Jiang et al., 2018) which both employ a single shared semanticvisual projection. From Table 4, we see that the DSP baseline performs best on all four datasets in terms of metric $H$. The high performance demonstrates the superiority of our domainspecific projections to the single shared semantic projection. Comparing with (Annadani and Biswas, 2018; Jiang et al., 2018), the other advantage of DSEN is that it makes the visual features more discriminative. In this work, we define the domain shift degree as ${\text{\mathit{M}\mathit{C}\mathit{A}}}_{s}{\text{\mathit{M}\mathit{C}\mathit{A}}}_{t}$. As the CUB for examples, we find that the domain shift degree for PSRZSL (Annadani and Biswas, 2018) and CDL (Jiang et al., 2018) are both larger than $30\%$. However, our DSEN only has a $12\%$ domain shift degree. This low domain shift degree proves that our domainspecific projections can generate domainrobust embedding features.
Different from embeddingbased PSRZSL and CDL, SEZSL (Xian et al., 2018b) and FGN (Kumar Verma et al., 2018) obtain stateoftheart performance by alleviating the domain shift problem with synthetic visual data in an unseen domain. However, they all employ the widely used fully supervised learning that can degrade the recognition performance on real seen domain data, i.e., FGN (Kumar Verma et al., 2018) obtains a 10% drop of ${\text{\mathit{M}\mathit{C}\mathit{A}}}_{s}$ with synthetic data. Instead, the ${\mathcal{L}}_{ddc}$ used in our DDC can reduce the influence of noisy synthesized data. For example, in CUB dataset, the DDC obtains MCA${}_{s}$ of 69.2%, which is higher than the values of 55.7% and 46.7% for FGN and SEZSL, respectively. As a consequence, the high MCA${}_{t}$ and MCA${}_{s}$ make DSEN obtain the highest $H$ among all datasets, which also demonstrates its effectiveness in generalized ZSL.
Comparison with conventional zeroshot learning. Comparison with conventional ZSL setting is shown in Table 5, where the testing images only come from an unseen domain. Notably, conventional ZSL setting is easier than generalized ZSL due to it ignores the searching space from the seen domain. From Table 5, we can observe that the proposed DSEN obtains the best performance on four datasets. Also, the DDC has achieved higher performance than the existing methods on four datasets. It proves that the powerful and discriminative visual representations by the endtoend trainable visual network are significant. Furthermore, compared to the $MC{A}_{t}$ in Table 4 in generalized zeroshot learning, we have found that the four baselines all obtain higher performance. The reason is that the conventional ZSL know prior information for the testing images belonging to which domains, which mitigates the projection domain shift problem.
Discussion. As shown in Table 4, DSEN achieves impressive improvement on CUB, aPY, and AWA2. However, it cannot obtain a consistent improvement on SUN dataset. The reason is that too many categories in SUN make it hard to generate good visual features from semantic attributes of low dimensions. More specifically, FGN uses GAN to generate synthetic visual features for an unseen domain, which is much more powerful than our twolayer generator. Thus, the distance between two generators is hard to remedy with domainspecific projections and classifiers, since there is a total of 717 categories in SUN. However, our DSEN finally obtains a slightly higher $H$ value than FGN, due to an obviously higher $MA{C}_{s}$ in the seen domain, which ensures the robustness of DSEN.
5. Conclusion
With an aim to solve the domain shift problem in generalized zeroshot learning, we propose a novel DomainSpecific Embedding Network by applying specific projections to seen and unseen domains based on domain characteristics. In contrast to existing methods using a single shared projection, we demonstrate that domainspecific projections can better capture domain similarities and differences, leading to more robust embedding features. To avoid domainseparated embedding space, a semantic reconstruction constraint is designed by using semantic labels to associate two specific projections in a cycle consistency way. Furthermore, a domain division constraint is developed to make the generated embedding features more distinguishable. Experiments on four benchmarks demonstrate the effectiveness of the proposed method.
In the future, powerful generators will be explored to provide more reliable synthetic visual representations, e.g., GAN. Also, domainspecific projection architectures will be explored by using autoML, which may yield further improvements.
6. Acknowledgement
This work is supported by the National Key Research and Development Program of China (2017YFC0820600), National Defense Science and Technology Fund for Distinguished Young Scholars (2017JCJQZQ022), the National Nature Science Foundation of China (61525206,61771468,61622211,61620106009),the Youth Innovation Promotion Association Chinese Academy of Sciences (2017209), National Postdoctoral Programme for Innovative Talents (BX20180358), and the Fundamental Research Funds for the Central Universities (WK2100100030).
References
 (1)
 Akata et al. (2013) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2013. Labelembedding for attributebased classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 819–826.
 Akata et al. (2016) Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2016. Labelembedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38, 7 (2016), 1425–1438.
 Akata et al. (2015) Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. 2015. Evaluation of output embeddings for finegrained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2927–2936.
 Annadani and Biswas (2018) Yashas Annadani and Soma Biswas. 2018. Preserving Semantic Relations for ZeroShot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7603–7612.
 Bucher et al. (2017) Maxime Bucher, Stéphane Herbin, and Frédéric Jurie. 2017. Generating visual representations for zeroshot classification. In Proceedings of the IEEE International Conference on Computer Vision. 2666–2673.
 Changpinyo et al. (2016) Soravit Changpinyo, WeiLun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized classifiers for zeroshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5327–5336.
 Chen et al. (2018) Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and ShihFu Chang. 2018. ZeroShot Visual Recognition using SemanticsPreserving Adversarial Embedding Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2.
 Fang et al. (2018) Shancheng Fang, Hongtao Xie, ZhengJun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 248–256.
 Farhadi et al. (2009) Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1778–1785.
 Frome et al. (2013) Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems. 2121–2129.
 Fu et al. (2014) Yanwei Fu, Timothy M Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2014. Transductive multiview embedding for zeroshot recognition and annotation. In European Conference on Computer Vision. Springer, 584–599.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
 He and Peng (2018) Xiangteng He and Yuxin Peng. 2018. Only Learn One Sample: FineGrained Visual Categorization with One Sample Training. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1372–1380.
 Jiang et al. (2018) Huajie Jiang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2018. Learning class prototypes via structure alignment for zeroshot recognition. In Proceedings of the European conference on computer vision. 118–134.
 Jiang et al. (2017) Huajie Jiang, Ruiping Wang, Shiguang Shan, Yi Yang, and Xilin Chen. 2017. Learning discriminative latent attributes for zeroshot classification. In Proceedings of the IEEE International Conference on Computer Vision. 4223–4232.
 Kodirov et al. (2015) Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. 2015. Unsupervised domain adaptation for zeroshot learning. In Proceedings of the IEEE International Conference on Computer Vision. 2452–2460.
 Kodirov et al. (2017) Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic autoencoder for zeroshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3174–3183.
 Kumar Verma et al. (2018) Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zeroshot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281–4289.
 Lampert et al. (2009) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by betweenclass attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 951–958.
 Lampert et al. (2014) Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attributebased classification for zeroshot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 453–465.
 Lazaridou et al. (2015) Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and pollution: Delving into crossspace mapping for zeroshot learning. In the 7th International Joint Conference on Natural Language Processing), Vol. 1. 270–280.
 Li et al. (2018) Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. 2018. Discriminative Learning of Latent Features for ZeroShot Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7463–7471.
 Long et al. (2018) Teng Long, Xing Xu, Youyou Li, Fumin Shen, Jingkuan Song, and Heng Tao Shen. 2018. Pseudo transfer with marginalized corrupted attribute for zeroshot learning. In 2018 ACM international conference on Multimedia. ACM, 1802–1810.
 Mishra et al. (2018) Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. 2018. A generative model for zero shot learning using conditional variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2188–2196.
 Morgado and Vasconcelos (2017) Pedro Morgado and Nuno Vasconcelos. 2017. Semantically consistent regularization for zeroshot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 9. 10.
 Niu et al. (2017) Yulei Niu, Zhiwu Lu, Songfang Huang, Xin Gao, and JiRong Wen. 2017. FeaBoost: Joint Feature and Label Refinement for Semantic Segmentation. In AAAI. 1474–1480.
 Palatucci et al. (2009) Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zeroshot learning with semantic output codes. In Advances in neural information processing systems. 1410–1418.
 Patterson and Hays (2012) Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2751–2758.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532–1543.
 Qiao et al. (2016) Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2016. Less is more: zeroshot learning from online textual documents with noise suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2249–2257.
 Radovanović et al. (2010) Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Hubs in space: Popular nearest neighbors in highdimensional data. Journal of Machine Learning Research 11, Sep (2010), 2487–2531.
 RomeraParedes and Torr (2015) Bernardino RomeraParedes and Philip Torr. 2015. An embarrassingly simple approach to zeroshot learning. In International Conference on Machine Learning. 2152–2161.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
 Shigeto et al. (2015) Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zeroshot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 135–151.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zeroshot learning through crossmodal transfer. In Advances in neural information processing systems. 935–943.
 Song et al. (2018) Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. 2018. Transductive Unbiased Embedding for ZeroShot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1024–1033.
 Tomasev et al. (2014) Nenad Tomasev, Milos Radovanovic, Dunja Mladenic, and Mirjana Ivanovic. 2014. The role of hubness in clustering highdimensional data. IEEE transactions on knowledge and data engineering 26, 3 (2014), 739–751.
 Wang et al. (2019) Chaojie Wang, Bo Chen, Sucheng Xiao, and Mingyuan Zhou. 2019. Convolutional Poisson Gamma Belief Network. In ICML.
 Wang et al. (2018) Chaojie Wang, Bo Chen, and Mingyuan Zhou. 2018. Multimodal Poisson gamma belief network. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Welinder et al. (2010) Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. CaltechUCSD birds 200. (2010).
 Xian et al. (2016) Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. 2016. Latent embeddings for zeroshot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 69–77.
 Xian et al. (2018a) Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018a. Zeroshot learninga comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence (2018).
 Xian et al. (2018b) Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018b. Feature generating networks for zeroshot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5542–5551.
 Xie et al. (2019) Hongtao Xie, Dongbao Yang, Nannan Sun, Zhineng Chen, and Yongdong Zhang. 2019. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognition 85 (2019), 109–119.
 Yang et al. (2016) Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zeroshot hashing via transferring supervised knowledge. In Proceedings of the 24th ACM international conference on Multimedia. ACM, 1286–1295.
 Zhang and Koniusz (2018) Hongguang Zhang and Piotr Koniusz. 2018. Zeroshot kernel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7670–7679.
 Zhang et al. (2017) Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zeroshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021–2030.
 Zheng et al. (2018) Feng Zheng, Xin Miao, and Heng Huang. 2018. Fast vehicle identification via ranked semantic sampling based embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, 3697–3703.