Hyperbolic Image Embeddings

  • 2019-04-03 21:10:12
  • Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, Victor Lempitsky
  • 49

Abstract

Computer vision tasks such as image classification, image retrieval andfew-shot learning are currently dominated by Euclidean and sphericalembeddings, so that the final decisions about class belongings or the degree ofsimilarity are made using linear hyperplanes, Euclidean distances, or sphericalgeodesic distances (cosine similarity). In this work, we demonstrate that inmany practical scenarios hyperbolic embeddings provide a better alternative.

 

Quick Read (beta)

Hyperbolic Image Embeddings

Valentin Khrulkov1Leyla Mirvakhabova1Evgeniya Ustinova1
Ivan Oseledets1,2Victor Lempitsky1,3

Skolkovo Institute of Science and Technology1
Institute of Numerical Mathematics of the Russian Academy of Sciences2
Samsung AI Center, Moscow3
{valentin.khrulkov,leyla.mirvakhabova,evgeniya.ustinova,i.oseledets}@skoltech.ru
[email protected]
Abstract

Computer vision tasks such as image classification, image retrieval and few-shot learning are currently dominated by Euclidean and spherical embeddings, so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios hyperbolic embeddings provide a better alternative.

\useunder

\ul

1 Introduction

Figure 1: An example of two–dimensional Poincaré embeddings computed by a hyperbolic neural network trained on MNIST, and evaluated additionally on Omniglot. Ambiguous and unclear images from MNIST, as well as most of the images from Omniglot are embedded near the center, while samples with clear class labels (or characters from Omniglot similar to one of the digits) lie near the boundary. *For inference Omniglot was normalized to have the same background color as MNIST. Omniglot images are marked with black crosses, MNIST images with colored dots.

High-dimensional embeddings are ubiquitous in modern computer vision. Many, perhaps most, modern computer vision systems learn non-linear mappings (in the form of deep convolutional networks) from the space of images or image fragments into high-dimensional spaces. The operations at the end of deep networks imply a certain type of geometry of the embedding spaces. For example, image classification networks [14, 17] use linear operators (matrix multiplication) to map embeddings in the penultimate layer to class logits. The class boundaries in the embedding space are thus piecewise-linear, and pairs of classes are separated by Euclidean hyperplanes. The embeddings learned by the model in the penultimate layer, therefore, live in the Euclidean space. The same can be said about systems where Euclidean distances are used to perform image retrieval [22, 32, 43], face recognition [23, 42] or one-shot learning [31].

Alternatively, some few-shot learning [39], face recognition [30] and person re-identification methods [38, 44] learn spherical embeddings, so that sphere projection operator is applied at the end of a network that computes the embeddings. Cosine similarity (closely associated with sphere geodesic distance) is then used by such architectures to match images.

Euclidean spaces with their zero curvature and spherical spaces with their positive curvature have certain profound implications on the nature of embeddings that existing computer vision systems can learn. In this work, we argue that hyperbolic spaces with negative curvature might often be more appropriate for learning embedding of images. Towards this end, we add the recently-proposed hyperbolic network layers [7] to the end of several computer vision networks, and present a number of experiments corresponding to image classification, image retrieval, one-shot, and few-shot learning and person re-identification. We show that in many cases, the use of hyperbolic geometry improves the performance over Euclidean or spherical embeddings.

Our work is inspired by the recent body of works that demonstrate the advantage of learning hyperbolic embeddings for language entities such as taxonomy entries [20], common words [36], phrases [5] and for other NLP tasks, such as neural machine translation  [8]. Our results imply that hyperbolic spaces may be as valuable for improving the performance of computer vision systems.

Motivation for hyperbolic image embeddings.

The use of hyperbolic spaces in natural language processing is motivated by their natural ability to embed hierarchies (e.g., tree graphs) with low distortion  [29]. Hierarchies are ubiquitous in natural language processing. First, there are natural hierarchies corresponding to, e.g., biological taxonomies and linguistic ontologies. Likewise, a more generic short phrase can have many plausible continuations and is therefore semantically-related to a multitude of long phrases that are not necessarily closely related to each other (in the semantic sense). The innate suitability of hyperbolic spaces to embedding hierarchies [27, 29] explains the success of such spaces in natural language processing  [20].

Figure 2: In many computer vision tasks, we want to learn image embeddings that obey the hierarchical constraints as shown above. E.g., in image retrieval (left), we may want to learn embeddings that put nearby images of the whole scene and its fragments, while separating images of non-overlapping fragments. In face recognition (right), we may want to put blurry/ambiguous face images close to sharp images of people to which they may correspond, while face images of different people (even lookalikes that are indistinguishable in blurry photos) have to be separated. Hyperbolic spaces are more suitable for embedding data under such hierarchical constraints than Euclidean or spherical ones.

Here, we argue that similar hierarchical relations between images are common in computer vision tasks (Figure 2). One can observe the following example cases:

  • In image retrieval, an overview photograph is related to many images that correspond to the close-ups of different distinct details. Likewise, for classification tasks in-the-wild, an image containing the representatives of multiple classes is related to images that contain representatives of the classes in isolation. Embedding a dataset that contains composite images into continuous space is therefore similar to embedding a hierarchy.

  • In some tasks, more generic images may correspond to images that contain less information and are therefore more ambiguous. E.g., in face recognition, a blurry and/or low-resolution face image taken from afar can be related to many high-resolution images of faces that clearly belong to distinct people. Again natural embeddings for image datasets that have widely varying image quality/ambiguity calls for retaining such hierarchical structure.

  • Many of the natural hierarchies investigated in natural language processing transcend to visual domain. E.g., the visual concepts of different animal species have the same natural hierarchical groupings stemming from the similarity of their genotypes (most felines share visual similarity while being visually distinct from pinnipeds). Hyperbolic spaces thus remain a natural choice for embedding biological taxonomies irrespective of whether we deal with text documents or images.

In order to build deep learning models which operate on the embeddings to hyperbolic spaces, we capitalize on recent developments  [7], which construct the analogues of familiar layers (such as a feed–forward layer, or a multinomial regression layer) in hyperbolic spaces. We show that many standard architectures used for tasks of image classification, and in particular in the few–shot learning setting can be easily modified to operate on hyperbolic embeddings, which in many cases also leads to their improvement.

2 Poincaré ball model

Formally, n-dimensional hyperbolic space denoted as n is defined as the homogeneous, simply connected n-dimensional Riemannian manifold of constant negative sectional curvature. The property of constant negative curvature makes it analogous to the ordinary Euclidean sphere (which has constant positive curvature), however, the geometrical properties of the hyperbolic space are very different. It is known that hyperbolic space cannot be isometrically embedded into Euclidean space  [13, 18], but there exist several well–studied models of hyperbolic geometry. In every model a certain subset of Euclidean space is endowed with a hyperbolic metric, however, all these models are isomorphic to each other and we may easily move from one to another base on where the formulas of interest are easier. We follow the majority of NLP works and use the Poincaré ball model (see Figure 3). Investigating the alternative models that might provide better numerical stability remain future work (though already started in the NLP community [21, 28]). Here, we provide a very short summary of the model.

The Poincaré ball model (𝔻n,g𝔻) is defined by the manifold 𝔻n={𝐱n:𝐱<1} endowed with the Riemannian metric g𝔻(𝐱)=λ𝐱2gE, where λ𝐱=21-𝐱2 is the conformal factor and gE is the Euclidean metric tensor gE=𝐈n. In this model the geodesic distance between two points is given by the following expression:

d𝔻(𝐱,𝐲)=arccosh(1+2𝐱-𝐲2(1-𝐱2)(1-𝐲2)). (1)
Figure 3: Visualization of the two–dimensional Poincaré ball. Point 𝐳 represents the Möbius sum of points 𝐱 and 𝐲. HypAve stands for hyperbolic averaging, and operation useful for computing the mean of a point cluster. Gray lines represent geodesics, curves of shortest length connecting two points. In order to specify the hyperbolic hyperplanes (bottom), used for multiclass logistic regression, one has to provide an origin point 𝐩 and a normal vector 𝐚T𝐩𝔻2{𝟎}. For more details on hyperbolic operations see Section 4.

In order to define the hyperbolic average, we will make use of the Klein model of hyperbolic space. Similarly to the Poincaré model, it is defined on the set 𝕂n={𝐱n:𝐱<1}, however, with a different metric, not relevant for further discussion. In Klein coordinates, the hyperbolic average (generalizing the usual Euclidean mean) takes the most simple form, and we present the necessary formulas in Section 4.

From the viewpoint of hyperbolic geometry, all points of Poincaré ball are equivalent. The models that we consider below are, however, hybrid in the sense that most layers use Euclidean operators, such as standard generalized convolutions, while only the final layers operate within the hyperbolic geometry framework. The hybrid nature of our setups makes the origin a special point, since from the Euclidean viewpoint the local volumes in Poincare ball expand exponentially from the origin to the boundary. This leads to the useful tendency of the learned embeddings to place more generic/ambiguous objects closer to the origin, while moving more specific objects towards the boundary. The distance to the origin in our models therefore provides a natural estimate of uncertainty, that can be used in several ways, as we show below.

3 Related work

Hyperbolic language embeddings

Hyperbolic embeddings in the natural language processing field have recently been very successful [20, 21]. They are motivated by the innate ability of hyperbolic spaces to embed hierarchies (e.g.,  tree graphs) with low distortion [28, 29]. The main result in this area states that any tree can be embedded into (two dimensional) hyperbolic space with arbitrarily low distortion. Another direction of research, more relevant to the present work is based on imposing hyperbolic structure on activations of neural networks [7, 8].

Natural language naturally gives rises to hierarchies, as, for instance,  some words are more specific while others are more generic. We argue that in-the-wild computer vision datasets also have more or less specific images that have a varying amount of information (e.g.,  middle images within triplets in Figure 2 are more generic/less specific).

Few–shot learning

The task of few–shot learning, which has recently attracted a lot of attention, is concerned with the overall ability of the model to generalize to unseen data during training. A body of papers devoted to few–shot classification that focuses on metric learning methods includes Siamese Networks  [12], Matching Networks  [39], Prototypical Networks  [31], Relation Networks  [35]. In contrast, other models apply meta-learning to few-shot learning: e.g., MAML by  [6], Meta-Learner LSTM by  [24], SNAIL by  [19]. While these methods employ either Euclidean or spherical geometries (like in  [39]), there is no model extension to hyperbolic space.

Person re-identification

The task of person re-identification is to match pedestrian images captured by possibly non-overlapping surveillance cameras. Papers [1, 9, 41] adopt the pairwise models that accept pairs of images and output their similarity scores. The resulting similarity scores are used to classify the input pairs as being matching or non-matching. Another popular direction of work includes approaches that aim at learning a mapping of the pedestrian images to the Euclidean descriptor space. Several papers, e.g., [34, 44] use verification loss functions based on the Euclidean distance or cosine similarity. A number of methods utilize a simple classification approach for training [3, 33, 11, 45], and Euclidean distance is used in test time. Some methods use the combination of the aforementioned approaches: [4] adopts verification losses that are applied to the similarities predicted by the special sub-network instead of the simple Euclidean similarities.

4 Hyperbolic neural networks

In this section, we remind on recent work introducing hyperbolic neural networks  [7]. Hyperbolic networks are extensions of conventional neural networks in a sense that they generalize typical neural network operations to those in hyperbolic space using the formalism of Möbius gyrovector spaces. In this paper, the authors present the hyperbolic versions of feed-forward networks, multinomial logistic regression, and recurrent neural networks.

Further in this section, we briefly discuss the hyperbolic functions and layers crucial for the understanding of hyperbolic neural networks used in the remainder of the paper. Similarly to the paper  [7], we use an additional hyperparameter c corresponding to the radius of the Poincaré ball, which is then defined in the following manner: 𝔻cn={𝐱n:c𝐱2<1,c0}. The corresponding conformal factor is then modified as λ𝐱c=21-c𝐱2. In practice, the choice of c allows one to balance between hyperbolic and Euclidean geometries, which is made precise by noting that with c0 all the formulas discussed below take their usual Euclidean form.

Möbius addition

For a pair 𝐱,𝐲𝔻cn, the Möbius addition is defined as follows:

𝐱c𝐲(1+2c𝐱,𝐲+c𝐲2)𝐱+(1-c𝐱2)𝐲1+2c𝐱,𝐲+c2𝐱2𝐲2. (2)

Distance

The induced distance function is defined as

dc(𝐱,𝐲)2carctanh(c-𝐱c𝐲). (3)

Note that with c=1 one recovers the geodesic distance (1), while with c0 we obtain the Euclidean distance limc0dc(𝐱,𝐲)=2𝐱-𝐲.

Exponential and logarithmic maps

To perform operations in the hyperbolic space, one first needs to define a bijective map from n to 𝔻cn in order to map Euclidean vectors to the hyperbolic space, and vice versa. The so–called exponential and (inverse to it) logarithmic map serve as such a bijection.

The exponential map exp𝐱c is a function from T𝐱𝔻cnn to 𝔻cn, which is given by

exp𝐱c(𝐯)𝐱c(tanh(cλ𝐱c𝐯2)𝐯c𝐯). (4)

The inverse logarithmic map is defined as

log𝐱c(𝐲)2cλ𝐱carctanh(c-𝐱c𝐲)-𝐱c𝐲-𝐱c𝐲. (5)

In practice, we use the maps exp𝟎c and log𝟎c for transition between the Euclidean and Poincaré ball representations of a vector.

Linear layer

Assume we have a standard (Euclidean) linear layer 𝐱M𝐱+𝐛. In order to generalize it, one needs to define the Möbius matrix by vector product:

Mc(𝐱)1ctanh(M𝐱𝐱arctanh(c𝐱))M𝐱M𝐱, (6)

if M𝐱𝟎, and Mc(𝐱)𝟎 otherwise. Finally, for a bias vector 𝐛𝔻cn the operation underlying the hyperbolic linear layer is then given by Mc(𝐱)c𝐛.

Concatenation of input vectors

In several architectures (e.g., in siamese networks), it is needed to concatenate two vectors; such operation is obvious in Euclidean space. However, straightforward concatenation of two vectors from hyperbolic space does not necessarily remain in hyperbolic space. Thus, we have to use a generalized version of the concatenation operation, which is then defined in the following manner. For 𝐱𝔻cn1, 𝐲𝔻cn2 we define the mapping Concat:𝔻cn1×𝔻cn2𝔻cn3 as follows.

Concat(𝐱,𝐲)=M1c𝐱cM2c𝐲, (7)

where M1 and M2 are trainable matrices of sizes n3×n1 and n3×n2 correspondingly. The motivation for this definition is simple: usually, the Euclidean concatenation layer is followed by a linear map, which when written explicitly takes the (Euclidean) form of Equation (7).

Hyperbolic averaging

Another important operation common in image processing is averaging of feature vectors, used, e.g., in prototypical networks for few–shot learning  [31]. In the Euclidean setting this operation takes the form (𝐱1,,𝐱N)1Ni𝐱i. Extension of this operation to hyperbolic spaces is called the Einstein midpoint and takes the most simple form in Klein coordinates:

HypAve(𝐱1,,𝐱N)=i=1Nγi𝐱i/i=1Nγi, (8)

where γi=11-c𝐱i2 are the Lorentz factors. Recall from the discussion in Section 2 that the Klein model is supported on the same space as the Poincaré ball, however the same point has different coordinate representations in these models. Let 𝐱𝔻 and 𝐱𝕂 denote the coordinates of the same point in the Poincaré and Klein models correspondingly. Then the following transition formulas hold.

𝐱𝔻 =𝐱𝕂1+1-c𝐱𝕂2, (9)
𝐱𝕂 =2𝐱𝔻1+c𝐱𝔻2. (10)

Thus, given points in the Poincaré ball we can first map them to the Klein model, compute the average using Equation (8), and then move it back to the Poincaré model.

Multiclass logistic regression (MLR)

In our experiments, to perform the multiclass classification, we take advantage of the generalization of multiclass logistic regression to hyperbolic spaces. The idea of this generalization is based on the observation that in Euclidean space logits can be represented as the distances to certain hyperplanes, where each hyperplane can be specified with a point of origin and a normal vector. The same construction can be used in the Poincaré ball after a suitable analogue for hyperplanes is introduced. Given 𝐩𝔻cn and 𝐚T𝐩𝔻cn{𝟎}, such an analogue would be the union of all geodesics passing through 𝐩 and orthogonal to 𝐚.

The resulting formula for hyperbolic MLR for K classes is written below; here 𝐩k𝔻cn and 𝐚kT𝐩k𝔻cn{𝟎} are learnable parameters.

p(y=k|𝐱)exp(λ𝐩kc𝐚kcarcsinh(2c-𝐩kc𝐱,𝐚k(1-c-𝐩kc𝐱2)𝐚k)).

For a more thorough discussion of hyperbolic neural networks, we refer the reader to the paper  [7].

Practical aspects of implementation

While implementing most of the formulas described above is straightforward, we employ some tricks to make the training more stable.

  • To ensure numerical stability we perform clipping by norm after applying the exponential map, which constrains the norm to not exceed 1c(1-10-3).

  • Some of the parameters in the aforementioned layers are naturally elements of 𝔻nc. While in principle it is possible to apply Riemannian optimization techniques to them (e.g., previously proposed Riemannian Adam optimizer  [2]), we did not observe any significant improvement. Instead, we parametrized them via ordinary Euclidean parameters which were mapped to their hyperbolic counterparts with the exponential map and used the standard Adam optimizer.

  • We found that the value of c may affect the performance, especially for large dimensions of the Poincaré ball. In our experiments, the value c=0.05 showed good results for broad range of dimensions.

5 Experiments

Experimental setup

We start with a toy experiment supporting our abovementioned hypothesis that the distance to the center in Poincaré ball indicates a model uncertainty. To do so, we first train the MLR classifier in hyperbolic space on the MNIST dataset  [16] and evaluate it on the Omniglot dataset  [15]. We then investigate and compare the obtained distributions of distances to the origin of hyperbolic embeddings of the MNIST and Omniglot test sets.

In our further experiments, we concentrate on the few-shot classification and person re-identification tasks. The experiments on the Omniglot dataset serve as a starting point, and then we move towards more complex datasets. Afterwards, we consider two datasets, namely: MiniImageNet  [24] and Caltech-UCSD Birds-200-2011 (CUB)  [40]. Here, for each dataset, we train four models: for one-shot five-way and five-shot five-way classification tasks both in the Euclidean and hyperbolic spaces. Finally, we provide the re-identification results for the two popular datasets: Market-1501  [46] and DukeMTMD  [25, 47]. Further in this section, we provide a thorough description of each experiment.

Our code is available at github11 1 https://github.com/KhrulkovV/hyperbolic-image-embeddings.

5.1 Distance to the origin as the measure of uncertainty

In this subsection, we validate our hypothesis which claims that if one trains a hyperbolic classifier, then a distance of the Poincaré ball embedding of an image can serve as a good measure of confidence of a model. We start by training a simple hyperbolic convolutional neural network (ConvNet) on the MNIST dataset, consisting of three convolutional layers of size 5×5 with 64 filters, followed by three linear layers with hidden size 256, except for the last linear layer, where hidden size corresponded to the embedding dimensionality and was varied. Parametric ReLU nonlinearity was used between all the layers. The output of the last hidden layer was mapped to the Poincaré ball using the exponential map (4) and was followed by the hyperbolic MLR layer. After training the model to 99% test accuracy, we evaluate it on the Omniglot dataset (by resizing images to 28×28 and normalizing them to have the same background color as MNIST). We then evaluate the hyperbolic distance to the origin of embeddings produced by the network on both datasets. The closest Euclidean analogue to this approach would be comparing distributions of pmax, maximum class probability predicted by the network. For the same range of dimensions we train ordinary Euclidean classifiers on MNIST, and compare these distributions for the same sets. Our findings are summarized in Figure 4 and Table 1. We observe that distances to the origin present a more statistically significant indicator of the dataset dissimilarity in 3 cases.

We have visualized the learned MNIST and Omniglot embeddings on Figure 1. We observe that more ‘unclear’ images are located near the center, while the images that are easy to classify are located closer to the boundary.

Figure 4: Distributions of the hyperbolic distance to the origin of the MNIST and Omniglot datasets embedded into the Poincar\'́e ball. Embeddings are computed by a hyperbolic neural network trained for the MNIST classification task. We observe a significant difference between these distributions: embeddings of the Omniglot images are much closer to the origin. Table 1 provides the KS distances between the distributions.
Table 1: Kolmogorov-Smirnov distances between the distributions of distance to the origin of the MNIST and Omniglot datasets embedded into the Poincaré ball with the hyperbolic classifier trained on MNIST, and between the distributions of pmax (maximum probablity predicted for a class) for the Euclidean classifier trained on MNIST and evaluated on the same sets. See further description in Subsection 5.1 and visualization on Figure 4. We observe that distance to the origin mostly presents a more statistically significant indicator of the dataset dissimilarity.
n=2 n=8 n=16 n=32
d𝔻(𝐱,𝟎) 0.868 0.832 0.853 0.859
pmax(𝐱) 0.834 0.835 0.840 0.846

5.2 Omniglot few-shot classification

We hypothesize that a certain class of problems – namely the few-shot classification task can benefit greatly from hyperbolic embeddings.

The starting point for our analysis is the experiments on the Omniglot dataset for few-shot classification. This dataset consists of the images of 1623 characters sampled from 50 different alphabets; each character is supported by 20 examples. We test several few-shot learning algorithms to see how hyperbolic embeddings affect them. As a baseline approach, we chose a siamese net with the backbone consisting of 4 convolutional blocks followed by two fully connected layers, producing embeddings of size n. In order to produce the similarity score s(𝐱,𝐲) for two input images (which is then fed into sigmoid, predicting whether these two images are of the same class), we consider three approaches. First one is Euclidean, for which we follow [12], with s(𝐱,𝐲)=iαi|f(i)(𝐱)-f(i)(𝐲)|. The other two use hyperbolic layers, for which we as before extend the backbone net with the exponential map. We try the following two approaches to compute the similarity between two hyperbolic embeddings. The first one is straightforward and we define similarity as s(𝐱,𝐲)=-dc(f(𝐱),f(𝐲). For the second one, we first concatenate the embeddings using the hyperbolic Concat layer (7), mapping these two embeddings into one hyperbolic point of size n. This point is then moved to n via the logarithmic map (5), and mapped to a number with a fully connected layer of size n×1. We test these approaches for different values of n in the 1-shot 20-way setting and report the results in Table 2. We observe that in both cases we get an accuracy boost, but slightly smaller with the Concat architecture. Value c=1 was used for these experiments.

In order to validate if hyperbolic embeddings can further improve models performing on the state-of-the-art level, for the next architecture, we choose the prototype network (ProtoNet) introduced in the paper  [31] with four convolutional blocks in a backbone. Each convolutional block consists of 3×3 convolutional layer followed by batch normalization, ReLU nonlinearity and 2×2 max-pooling layer. The number of filters in the last convolutional layer corresponds to the value of the embedding dimension, for which we choose 64. The hyperbolic model differs from the baseline in the following aspects. First, the output of the last convolutional block is embedded into the Poincaré ball of dimension 64 using the exponential map. In ProtoNet, one uses a so-called prototype representation of a class, which is defined as a mean of the embedded support set of a class. Generalizing this concept to hyperbolic space, we substitute the Euclidean mean operation by HypAve, defined earlier in the Equation (8). We selected c=0.05 and trained the models for various few-shot scenarios. The initial value of learning rate equals to 10-3 and is multiplied by 0.5 every 20 epochs out of total 60 epochs. Results are presented in Table 3. We can see that in some scenarios, in particular for one–shot learning, hyperbolic embeddings are more beneficial, while in other cases results are slightly worse.

Table 2: Comparison of the Euclidean and hyperbolic siamese networks in the 1-shot 20-way setting on the Omniglot dataset. Hyperbolic architectures use Concat based (7) similarity, and hyperbolic distance similarity correspondingly. Note that hyperbolic architectures outperform the Euclidean variant with significant margin. c=1 was used for this experiment.
dim 128 256 512

Euclidean similarity
90.08 91.03 91.32
Hyperbolic Concat 93.63 94.23 94.62
Hyperbolic distance 95.54 95.00 19.25
Table 3: Few-shot classification accuracies on Omniglot. In order to obtain Hyperbolic ProtoNet, we augment the standard ProtoNet with a mapping to the Poincaré ball, use hyperbolic distance as the distance function, and as the averaging operator we use the HypAve operator defined by Equation (8).
ProtoNet Hyperbolic ProtoNet
1-shot 5-way 98.2 99.0
5-shot 5-way 99.4 99.4
1-shot 20-way 95.8 95.9
5-shot 20-way 98.6 98.15
Table 4: Experimental results on two datasets: MiniImageNet and CUB averaged over 10,000 test episodes and are reported with 95% confidence intervals. As the baseline model, we consider ProtoNet with the parameters described in Subsections 5.3 and 5.4. The hyperbolic model is the generalization of the baseline model made by adding the mapping to Poincaré ball and substituting Euclidean mean operation to hyperbolic averaging. The obtained accuracy values indicate better performance of the hyperbolic model. We may conclude that in these experiments transferring data into hyperbolic space and replacing the corresponding operations lead to higher accuracy values.
Dataset Model 1-shot 5-way 5-shot 5-way
MiniImageNet MatchNet  [39] 43.56±0.84 55.31±0.73
ProtoNet 48.29±0.19 66.11±0.16
RelationNet  [35] 50.44±0.82 65.32±0.70
Hyperbolic ProtoNet 51.57±0.2 66.27±0.17
CUB ProtoNet 54.58±0.24 68.04±0.19
Hyperbolic ProtoNet 60.52±0.25 72.22±0.19

5.3 MiniImageNet few-shot classification

MiniImageNet dataset is the subset of ImageNet dataset  [26], which contains of 100 classes represented by 600 examples per class. We use the following split provided in the paper  [24]: training dataset consists of 64 classes, validation dataset is represented by 16 classes, and the remaining 20 classes serve as a test dataset.

As a baseline model, we once again consider the ProtoNet model, mentioned earlier in the experiments on Omniglot dataset in Subsection 5.2. In these experiments, we set the value of the embedding dimension to 1024. We test the models on tasks for one-shot and five-shot classifications; the number of query points in each batch always equals to 15. Similarly, for the hyperbolic version of ProtoNet, we have the following experimental setup. We put the value of Poincaré ball radius to c=0.05, and consider the following learning rate decay scheme: the initial learning rate equals to 10-3 and is further multiplied by 0.2 every 10 epochs (out of total 200 epochs).

Table 4 illustrates the obtained results on MiniImageNet dataset. For MiniImageNet dataset, the results of the other models are available for the same classification tasks (i.e., for one-shot and five-shot learning). Therefore, we can compare our obtained results to those that were reported in the original papers. From these experimental results, we may observe a slight gain in model accuracy.

5.4 Caltech-UCSD Birds few-shot classification

The CUB dataset consists of 11,788 images of 200 bird species and was designed for fine-grained classification. We use the split introduced in  [37]: 100 classes out of 200 were used for training, 50 for validation and 50 for testing. Also, following  [37], we make the same pre-processing step by resizing each image to the size of 64×64.

Likewise, we use ProtoNet mentioned above with the following modifications. Here, we fix the embedding dimension to 512 and use a slightly different setup for learning rate scheduler: the initial learning rate of value 10-3 is multiplied by 0.7 every 20 epochs out of total 100 epochs. Remaining architecture and parameters both in baseline and hyperbolic models are identical to those in the experiments on the MiniImageNet dataset.

Our findings on the experiments on the CUB dataset are summarized in Table 4. Interestingly, for this dataset, the hyperbolic version significantly outperforms its Euclidean counterpart.

5.5 Person re-identification

Table 5: Person re-identification results for Market-1501 and DukeMTMC-reID for the classification baseline (bs) and its hyperbolic counterpart (hyp). (See 5.5 for the details). The results are shown for the three embedding dimensionalities (dim column) and for the two learning rate schedules. For each dataset and each embedding dimensionality value, the best results are bold, they are all given by the hyperbolic version of classification (either by the schedule sch#1 or sch#2). The second-best results are underlined.
{adjustwidth}

-.5in-.5in


Market-1501 DukeMTMC-reID
bs hyp bs hyp

dim
lr schedule r1 mAP r1 mAP r1 mAP r1 mAP
32 sch#1 \ul71.4 \ul49.7 69.8 45.9 56.1 35.6 56.5 34.9
sch#2 68.0 43.4 75.9 51.9 \ul57.2 \ul35.7 62.2 39.1
64 sch#1 80.3 60.3 \ul83.1 \ul60.1 69.9 48.5 70.8 48.6
sch#2 80.5 57.8 84.4 62.7 68.3 45.5 \ul70.7 \ul48.6
128 sch#1 86.0 67.3 87.8 68.4 \ul74.1 \ul53.3 76.5 55.4
sch#2 \ul86.5 \ul68.5 86.4 66.2 71.5 51.5 74.0 52.2

The DukeMTMC-reID dataset contains 16,522 training images of 702 identities, 2228 query images of 702 identities and 17,661 gallery images. Market1501 contains 12936 training images of 751 identities, 3368 queries of 750 identities and 15913 gallery images respectively. We report Rank1 of the Cumulative matching Charcteric Curve and Mean Average Precision for both datasets. We use ResNet-50  [10] architecture with one fully connected embedding layer following the global average pooling. Three embedding dimensionalities are used in our experiments: 32, 64 and 128. For the baseline experiments, we add the additional classification linear layer, followed by the cross-entropy loss. For the hyperbolic version of the experiments, we map the descriptors to the Poincaré ball and apply multiclass logistic regression as described in Section 4. We found that in both cases the results are very sensitive to the learning rate schedules. We tried four schedules for learning 32-dimensional descriptors for both baseline and hyperbolic versions. Two best performing schedules were applied for the 64 and 128-dimensional descriptors. In these experiments, we also found that smaller c values give better results. We finally set c to 10-5. Therefore, based on the discussion in 4, our hyperbolic setting is quite close to Euclidean. The results are compiled in Table 5. We set starting learning rates to 310-4 and 610-4 for sch#1 and sch#2 correspondingly and multiply them by 0.1 after each of the epochs 200 and 270. The results are reported after the 300 training epochs. As we can see in the Table 5, hyperbolic version generally performs better than the baseline, while the gap between the baseline and hyperbolic versions’ results is decreasing for larger dimensionalities.

6 Discussion and conclusion

We have investigated the use of hyperbolic spaces for image embeddings. The models that we have considered use Euclidean operations in most layers, and use the exponential map to move from the Euclidean to hyperbolic spaces at the end of the network (akin to the normalization layers that are used to map from the Euclidean space to Euclidean spheres). The approach that we investigate here is thus compatible with existing backbone networks trained in Euclidean geometry.

At the same time, we have shown that across a number of tasks, in particular in the few-shot image classification, learning hyperbolic embeddings can result in a substantial boost in accuracy. We speculate that the negative curvature of the hyperbolic spaces allows for embeddings that are better conforming to the intrinsic geometry of at least some image manifolds with their hierarchical structure.

Future work may include several potential modifications of the approach. We have observed that the use of hyperbolic embeddings improves performance for some problems and datasets, while not helping others. A better understanding of when and why the use of hyperbolic geometry is justified is therefore needed. Also, we note that while all hyperbolic geometry models are equivalent in the continuous setting, fixed-precision arithmetic used in real computers breaks this equivalence. In practice, we observed that care should be taken about numeric precision effects (following [7], we clip the embeddings to minimize numerical errors during learning). Using other models of hyperbolic geometry may result in more favourable floating point performance.

References

  • [1] E. Ahmed, M. J. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In Conf. Computer Vision and Pattern Recognition, CVPR, pages 3908–3916, 2015.
  • [2] G. Becigneul and O.-E. Ganea. Riemannian Adaptive Optimization Methods. In International Conference on Learning Representations, 2019.
  • [3] X. Chang, T. M. Hospedales, and T. Xiang. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2109–2118, 2018.
  • [4] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017.
  • [5] B. Dhingra, C. J. Shallue, M. Norouzi, A. M. Dai, and G. E. Dahl. Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313, 2018.
  • [6] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • [7] O. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pages 5350–5360, 2018.
  • [8] C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, and N. de Freitas. Hyperbolic attention networks. In International Conference on Learning Representations, 2019.
  • [9] Y. Guo and N.-M. Cheung. Efficient and deep person re-identification using multi-level similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2335–2344, 2018.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1062–1071, 2018.
  • [12] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  • [13] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • [15] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in Neural Information Processing Systems, pages 2526–2534, 2013.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [17] Y. LeCun et al. Generalization and network design strategies. In Connectionism in perspective, volume 19. Citeseer, 1989.
  • [18] N. Linial, A. Magen, and M. E. Saks. Low distortion euclidean embeddings of trees. Israel Journal of Mathematics, 106(1):339–348, 1998.
  • [19] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  • [20] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6338–6347, 2017.
  • [21] M. Nickel and D. Kiela. Learning continuous hierarchies in the Lorentz model of Hyperbolic geometry. In Proc. ICML, pages 3776–3785, 2018.
  • [22] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016.
  • [23] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
  • [24] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
  • [25] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
  • [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [27] F. Sala, C. De Sa, A. Gu, and C. Ré. Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pages 4457–4466, 2018.
  • [28] F. Sala, C. D. Sa, A. Gu, and C. Ré. Representation tradeoffs for hyperbolic embeddings. In Proc. ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 4457–4466. JMLR.org, 2018.
  • [29] R. Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pages 355–366. Springer, 2011.
  • [30] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [31] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [32] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
  • [33] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 3960–3969, 2017.
  • [34] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419, 2018.
  • [35] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
  • [36] A. Tifrea, G. Bécigneul, and O.-E. Ganea. Poincaré glove: Hyperbolic word embeddings. arXiv preprint arXiv:1810.06546, 2018.
  • [37] E. Triantafillou, R. Zemel, and R. Urtasun. Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2255–2265, 2017.
  • [38] E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, pages 4170–4178, 2016.
  • [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • [40] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
  • [41] Y. Wang, Z. Chen, F. Wu, and G. Wang. Person re-identification with cascaded pairwise convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1470–1478, 2018.
  • [42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [43] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.
  • [44] D. Yi, Z. Lei, and S. Z. Li. Deep metric learning for practical person re-identification. arXiv prepzrint arXiv:1407.4979, 2014.
  • [45] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1077–1085, 2017.
  • [46] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Computer Vision, IEEE International Conference on, 2015.
  • [47] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, 2017.