Abstract
Computer vision tasks such as image classification, image retrieval andfewshot learning are currently dominated by Euclidean and sphericalembeddings, so that the final decisions about class belongings or the degree ofsimilarity are made using linear hyperplanes, Euclidean distances, or sphericalgeodesic distances (cosine similarity). In this work, we demonstrate that inmany practical scenarios hyperbolic embeddings provide a better alternative.
Quick Read (beta)
Hyperbolic Image Embeddings
Abstract
Computer vision tasks such as image classification, image retrieval and fewshot learning are currently dominated by Euclidean and spherical embeddings, so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios hyperbolic embeddings provide a better alternative.
\ul
1 Introduction
Highdimensional embeddings are ubiquitous in modern computer vision. Many, perhaps most, modern computer vision systems learn nonlinear mappings (in the form of deep convolutional networks) from the space of images or image fragments into highdimensional spaces. The operations at the end of deep networks imply a certain type of geometry of the embedding spaces. For example, image classification networks [14, 17] use linear operators (matrix multiplication) to map embeddings in the penultimate layer to class logits. The class boundaries in the embedding space are thus piecewiselinear, and pairs of classes are separated by Euclidean hyperplanes. The embeddings learned by the model in the penultimate layer, therefore, live in the Euclidean space. The same can be said about systems where Euclidean distances are used to perform image retrieval [22, 32, 43], face recognition [23, 42] or oneshot learning [31].
Alternatively, some fewshot learning [39], face recognition [30] and person reidentification methods [38, 44] learn spherical embeddings, so that sphere projection operator is applied at the end of a network that computes the embeddings. Cosine similarity (closely associated with sphere geodesic distance) is then used by such architectures to match images.
Euclidean spaces with their zero curvature and spherical spaces with their positive curvature have certain profound implications on the nature of embeddings that existing computer vision systems can learn. In this work, we argue that hyperbolic spaces with negative curvature might often be more appropriate for learning embedding of images. Towards this end, we add the recentlyproposed hyperbolic network layers [7] to the end of several computer vision networks, and present a number of experiments corresponding to image classification, image retrieval, oneshot, and fewshot learning and person reidentification. We show that in many cases, the use of hyperbolic geometry improves the performance over Euclidean or spherical embeddings.
Our work is inspired by the recent body of works that demonstrate the advantage of learning hyperbolic embeddings for language entities such as taxonomy entries [20], common words [36], phrases [5] and for other NLP tasks, such as neural machine translation [8]. Our results imply that hyperbolic spaces may be as valuable for improving the performance of computer vision systems.
Motivation for hyperbolic image embeddings.
The use of hyperbolic spaces in natural language processing is motivated by their natural ability to embed hierarchies (e.g., tree graphs) with low distortion [29]. Hierarchies are ubiquitous in natural language processing. First, there are natural hierarchies corresponding to, e.g., biological taxonomies and linguistic ontologies. Likewise, a more generic short phrase can have many plausible continuations and is therefore semanticallyrelated to a multitude of long phrases that are not necessarily closely related to each other (in the semantic sense). The innate suitability of hyperbolic spaces to embedding hierarchies [27, 29] explains the success of such spaces in natural language processing [20].
Here, we argue that similar hierarchical relations between images are common in computer vision tasks (Figure 2). One can observe the following example cases:

•
In image retrieval, an overview photograph is related to many images that correspond to the closeups of different distinct details. Likewise, for classification tasks inthewild, an image containing the representatives of multiple classes is related to images that contain representatives of the classes in isolation. Embedding a dataset that contains composite images into continuous space is therefore similar to embedding a hierarchy.

•
In some tasks, more generic images may correspond to images that contain less information and are therefore more ambiguous. E.g., in face recognition, a blurry and/or lowresolution face image taken from afar can be related to many highresolution images of faces that clearly belong to distinct people. Again natural embeddings for image datasets that have widely varying image quality/ambiguity calls for retaining such hierarchical structure.

•
Many of the natural hierarchies investigated in natural language processing transcend to visual domain. E.g., the visual concepts of different animal species have the same natural hierarchical groupings stemming from the similarity of their genotypes (most felines share visual similarity while being visually distinct from pinnipeds). Hyperbolic spaces thus remain a natural choice for embedding biological taxonomies irrespective of whether we deal with text documents or images.
In order to build deep learning models which operate on the embeddings to hyperbolic spaces, we capitalize on recent developments [7], which construct the analogues of familiar layers (such as a feed–forward layer, or a multinomial regression layer) in hyperbolic spaces. We show that many standard architectures used for tasks of image classification, and in particular in the few–shot learning setting can be easily modified to operate on hyperbolic embeddings, which in many cases also leads to their improvement.
2 Poincaré ball model
Formally, $n$dimensional hyperbolic space denoted as ${\mathbb{H}}^{n}$ is defined as the homogeneous, simply connected $n$dimensional Riemannian manifold of constant negative sectional curvature. The property of constant negative curvature makes it analogous to the ordinary Euclidean sphere (which has constant positive curvature), however, the geometrical properties of the hyperbolic space are very different. It is known that hyperbolic space cannot be isometrically embedded into Euclidean space [13, 18], but there exist several well–studied models of hyperbolic geometry. In every model a certain subset of Euclidean space is endowed with a hyperbolic metric, however, all these models are isomorphic to each other and we may easily move from one to another base on where the formulas of interest are easier. We follow the majority of NLP works and use the Poincaré ball model (see Figure 3). Investigating the alternative models that might provide better numerical stability remain future work (though already started in the NLP community [21, 28]). Here, we provide a very short summary of the model.
The Poincaré ball model $({\mathbb{D}}^{n},{g}^{\mathbb{D}})$ is defined by the manifold $$ endowed with the Riemannian metric ${g}^{\mathbb{D}}(\mathbf{x})={\lambda}_{\mathbf{x}}^{2}{g}^{E}$, where ${\lambda}_{\mathbf{x}}=\frac{2}{1{\parallel \mathbf{x}\parallel}^{2}}$ is the conformal factor and ${g}^{E}$ is the Euclidean metric tensor ${g}^{E}={\mathbf{I}}^{n}$. In this model the geodesic distance between two points is given by the following expression:
$${d}_{\mathbb{D}}(\mathbf{x},\mathbf{y})=\mathrm{arccosh}\left(1+2\frac{{\parallel \mathbf{x}\mathbf{y}\parallel}^{2}}{(1{\parallel \mathbf{x}\parallel}^{2})(1{\parallel \mathbf{y}\parallel}^{2})}\right).$$  (1) 
In order to define the hyperbolic average, we will make use of the Klein model of hyperbolic space. Similarly to the Poincaré model, it is defined on the set $$, however, with a different metric, not relevant for further discussion. In Klein coordinates, the hyperbolic average (generalizing the usual Euclidean mean) takes the most simple form, and we present the necessary formulas in Section 4.
From the viewpoint of hyperbolic geometry, all points of Poincaré ball are equivalent. The models that we consider below are, however, hybrid in the sense that most layers use Euclidean operators, such as standard generalized convolutions, while only the final layers operate within the hyperbolic geometry framework. The hybrid nature of our setups makes the origin a special point, since from the Euclidean viewpoint the local volumes in Poincare ball expand exponentially from the origin to the boundary. This leads to the useful tendency of the learned embeddings to place more generic/ambiguous objects closer to the origin, while moving more specific objects towards the boundary. The distance to the origin in our models therefore provides a natural estimate of uncertainty, that can be used in several ways, as we show below.
3 Related work
Hyperbolic language embeddings
Hyperbolic embeddings in the natural language processing field have recently been very successful [20, 21]. They are motivated by the innate ability of hyperbolic spaces to embed hierarchies (e.g., tree graphs) with low distortion [28, 29]. The main result in this area states that any tree can be embedded into (two dimensional) hyperbolic space with arbitrarily low distortion. Another direction of research, more relevant to the present work is based on imposing hyperbolic structure on activations of neural networks [7, 8].
Natural language naturally gives rises to hierarchies, as, for instance, some words are more specific while others are more generic. We argue that inthewild computer vision datasets also have more or less specific images that have a varying amount of information (e.g., middle images within triplets in Figure 2 are more generic/less specific).
Few–shot learning
The task of few–shot learning, which has recently attracted a lot of attention, is concerned with the overall ability of the model to generalize to unseen data during training. A body of papers devoted to few–shot classification that focuses on metric learning methods includes Siamese Networks [12], Matching Networks [39], Prototypical Networks [31], Relation Networks [35]. In contrast, other models apply metalearning to fewshot learning: e.g., MAML by [6], MetaLearner LSTM by [24], SNAIL by [19]. While these methods employ either Euclidean or spherical geometries (like in [39]), there is no model extension to hyperbolic space.
Person reidentification
The task of person reidentification is to match pedestrian images captured by possibly nonoverlapping surveillance cameras. Papers [1, 9, 41] adopt the pairwise models that accept pairs of images and output their similarity scores. The resulting similarity scores are used to classify the input pairs as being matching or nonmatching. Another popular direction of work includes approaches that aim at learning a mapping of the pedestrian images to the Euclidean descriptor space. Several papers, e.g., [34, 44] use verification loss functions based on the Euclidean distance or cosine similarity. A number of methods utilize a simple classification approach for training [3, 33, 11, 45], and Euclidean distance is used in test time. Some methods use the combination of the aforementioned approaches: [4] adopts verification losses that are applied to the similarities predicted by the special subnetwork instead of the simple Euclidean similarities.
4 Hyperbolic neural networks
In this section, we remind on recent work introducing hyperbolic neural networks [7]. Hyperbolic networks are extensions of conventional neural networks in a sense that they generalize typical neural network operations to those in hyperbolic space using the formalism of Möbius gyrovector spaces. In this paper, the authors present the hyperbolic versions of feedforward networks, multinomial logistic regression, and recurrent neural networks.
Further in this section, we briefly discuss the hyperbolic functions and layers crucial for the understanding of hyperbolic neural networks used in the remainder of the paper. Similarly to the paper [7], we use an additional hyperparameter $c$ corresponding to the radius of the Poincaré ball, which is then defined in the following manner: $$. The corresponding conformal factor is then modified as ${\lambda}_{\mathbf{x}}^{c}=\frac{2}{1c{\parallel \mathbf{x}\parallel}^{2}}$. In practice, the choice of $c$ allows one to balance between hyperbolic and Euclidean geometries, which is made precise by noting that with $c\to 0$ all the formulas discussed below take their usual Euclidean form.
Möbius addition
For a pair $\mathbf{x},\mathbf{y}\in {\mathbb{D}}_{c}^{n}$, the Möbius addition is defined as follows:
$$\mathbf{x}{\oplus}_{c}\mathbf{y}\u2254\frac{(1+2c\u27e8\mathbf{x},\mathbf{y}\u27e9+c{\parallel \mathbf{y}\parallel}^{2})\mathbf{x}+(1c{\parallel \mathbf{x}\parallel}^{2})\mathbf{y}}{1+2c\u27e8\mathbf{x},\mathbf{y}\u27e9+{c}^{2}{\parallel \mathbf{x}\parallel}^{2}{\parallel \mathbf{y}\parallel}^{2}}.$$  (2) 
Distance
The induced distance function is defined as
$${d}_{c}(\mathbf{x},\mathbf{y})\u2254\frac{2}{\sqrt{c}}\mathrm{arctanh}(\sqrt{c}\parallel \mathbf{x}{\oplus}_{c}\mathbf{y}\parallel ).$$  (3) 
Note that with $c=1$ one recovers the geodesic distance (1), while with $c\to 0$ we obtain the Euclidean distance ${lim}_{c\to 0}{d}_{c}(\mathbf{x},\mathbf{y})=2\parallel \mathbf{x}\mathbf{y}\parallel .$
Exponential and logarithmic maps
To perform operations in the hyperbolic space, one first needs to define a bijective map from ${\mathbb{R}}^{n}$ to ${\mathbb{D}}_{c}^{n}$ in order to map Euclidean vectors to the hyperbolic space, and vice versa. The so–called exponential and (inverse to it) logarithmic map serve as such a bijection.
The exponential map ${\mathrm{exp}}_{\mathbf{x}}^{c}$ is a function from ${T}_{\mathbf{x}}{\mathbb{D}}_{c}^{n}\cong {\mathbb{R}}^{n}$ to ${\mathbb{D}}_{c}^{n}$, which is given by
$${\mathrm{exp}}_{\mathbf{x}}^{c}(\mathbf{v})\u2254\mathbf{x}{\oplus}_{c}\left(\mathrm{tanh}\left(\sqrt{c}\frac{{\lambda}_{\mathbf{x}}^{c}\parallel \mathbf{v}\parallel}{2}\right)\frac{\mathbf{v}}{\sqrt{c}\parallel \mathbf{v}\parallel}\right).$$  (4) 
The inverse logarithmic map is defined as
$${\mathrm{log}}_{\mathbf{x}}^{c}(\mathbf{y})\u2254\frac{2}{\sqrt{c}{\lambda}_{\mathbf{x}}^{c}}\mathrm{arctanh}(\sqrt{c}\parallel \mathbf{x}{\oplus}_{c}\mathbf{y}\parallel )\frac{\mathbf{x}{\oplus}_{c}\mathbf{y}}{\parallel \mathbf{x}{\oplus}_{c}\mathbf{y}\parallel}.$$  (5) 
In practice, we use the maps ${\mathrm{exp}}_{\mathrm{\U0001d7ce}}^{c}$ and ${\mathrm{log}}_{\mathrm{\U0001d7ce}}^{c}$ for transition between the Euclidean and Poincaré ball representations of a vector.
Linear layer
Assume we have a standard (Euclidean) linear layer $\mathbf{x}\to \mathrm{M}\mathbf{x}+\mathbf{b}$. In order to generalize it, one needs to define the Möbius matrix by vector product:
$${\mathrm{M}}^{{\otimes}_{c}}(\mathbf{x})\u2254\frac{1}{\sqrt{c}}\mathrm{tanh}\left(\frac{\parallel \mathrm{M}\mathbf{x}\parallel}{\parallel \mathbf{x}\parallel}\mathrm{arctanh}(\sqrt{c}\parallel \mathbf{x}\parallel )\right)\frac{\mathrm{M}\mathbf{x}}{\parallel \mathrm{M}\mathbf{x}\parallel},$$  (6) 
if $\mathrm{M}\mathbf{x}\ne \mathrm{\U0001d7ce}$, and ${\mathrm{M}}^{{\otimes}_{c}}(\mathbf{x})\u2254\mathrm{\U0001d7ce}$ otherwise. Finally, for a bias vector $\mathbf{b}\in {\mathbb{D}}_{c}^{n}$ the operation underlying the hyperbolic linear layer is then given by ${\mathrm{M}}^{{\otimes}_{c}}(\mathbf{x}){\oplus}_{c}\mathbf{b}$.
Concatenation of input vectors
In several architectures (e.g., in siamese networks), it is needed to concatenate two vectors; such operation is obvious in Euclidean space. However, straightforward concatenation of two vectors from hyperbolic space does not necessarily remain in hyperbolic space. Thus, we have to use a generalized version of the concatenation operation, which is then defined in the following manner. For $\mathbf{x}\in {\mathbb{D}}_{c}^{{n}_{1}}$, $\mathbf{y}\in {\mathbb{D}}_{c}^{{n}_{2}}$ we define the mapping $\mathrm{Concat}:{\mathbb{D}}_{c}^{{n}_{1}}\times {\mathbb{D}}_{c}^{{n}_{2}}\to {\mathbb{D}}_{c}^{{n}_{3}}$ as follows.
$$\mathrm{Concat}(\mathbf{x},\mathbf{y})={\mathrm{M}}_{1}^{{\otimes}_{c}}\mathbf{x}{\oplus}_{c}{\mathrm{M}}_{2}^{{\otimes}_{c}}\mathbf{y},$$  (7) 
where ${\mathrm{M}}_{1}$ and ${\mathrm{M}}_{2}$ are trainable matrices of sizes ${n}_{3}\times {n}_{1}$ and ${n}_{3}\times {n}_{2}$ correspondingly. The motivation for this definition is simple: usually, the Euclidean concatenation layer is followed by a linear map, which when written explicitly takes the (Euclidean) form of Equation (7).
Hyperbolic averaging
Another important operation common in image processing is averaging of feature vectors, used, e.g., in prototypical networks for few–shot learning [31]. In the Euclidean setting this operation takes the form $({\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N})\to \frac{1}{N}{\sum}_{i}{\mathbf{x}}_{i}$. Extension of this operation to hyperbolic spaces is called the Einstein midpoint and takes the most simple form in Klein coordinates:
$$\mathrm{HypAve}({\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N})=\sum _{i=1}^{N}{\gamma}_{i}{\mathbf{x}}_{i}/\sum _{i=1}^{N}{\gamma}_{i},$$  (8) 
where ${\gamma}_{i}=\frac{1}{\sqrt{1c{\parallel {\mathbf{x}}_{i}\parallel}^{2}}}$ are the Lorentz factors. Recall from the discussion in Section 2 that the Klein model is supported on the same space as the Poincaré ball, however the same point has different coordinate representations in these models. Let ${\mathbf{x}}_{\mathbb{D}}$ and ${\mathbf{x}}_{\mathbb{K}}$ denote the coordinates of the same point in the Poincaré and Klein models correspondingly. Then the following transition formulas hold.
${\mathbf{x}}_{\mathbb{D}}$  $={\displaystyle \frac{{\mathbf{x}}_{\mathbb{K}}}{1+\sqrt{1c{\parallel {\mathbf{x}}_{\mathbb{K}}\parallel}^{2}}}},$  (9)  
${\mathbf{x}}_{\mathbb{K}}$  $={\displaystyle \frac{2{\mathbf{x}}_{\mathbb{D}}}{1+c{\parallel {\mathbf{x}}_{\mathbb{D}}\parallel}^{2}}}.$  (10) 
Thus, given points in the Poincaré ball we can first map them to the Klein model, compute the average using Equation (8), and then move it back to the Poincaré model.
Multiclass logistic regression (MLR)
In our experiments, to perform the multiclass classification, we take advantage of the generalization of multiclass logistic regression to hyperbolic spaces. The idea of this generalization is based on the observation that in Euclidean space logits can be represented as the distances to certain hyperplanes, where each hyperplane can be specified with a point of origin and a normal vector. The same construction can be used in the Poincaré ball after a suitable analogue for hyperplanes is introduced. Given $\mathbf{p}\in {\mathbb{D}}_{c}^{n}$ and $\mathbf{a}\in {T}_{\mathbf{p}}{\mathbb{D}}_{c}^{n}\setminus \{\mathrm{\U0001d7ce}\}$, such an analogue would be the union of all geodesics passing through $\mathbf{p}$ and orthogonal to $\mathbf{a}$.
The resulting formula for hyperbolic MLR for $K$ classes is written below; here ${\mathbf{p}}_{k}\in {\mathbb{D}}_{c}^{n}$ and ${\mathbf{a}}_{k}\in {T}_{{\mathbf{p}}_{k}}{\mathbb{D}}_{c}^{n}\setminus \{\mathrm{\U0001d7ce}\}$ are learnable parameters.
$$\begin{array}{cc}\hfill p(y& =k\mathbf{x})\propto \hfill \\ & \mathrm{exp}\left(\frac{{\lambda}_{{\mathbf{p}}_{k}}^{c}\parallel {\mathbf{a}}_{k}\parallel}{\sqrt{c}}\mathrm{arcsinh}\left(\frac{2\sqrt{c}\u27e8{\mathbf{p}}_{k}{\oplus}_{c}\mathbf{x},{\mathbf{a}}_{k}\u27e9}{(1c{\parallel {\mathbf{p}}_{k}{\oplus}_{c}\mathbf{x}\parallel}^{2})\parallel {\mathbf{a}}_{k}\parallel}\right)\right).\hfill \end{array}$$ 
For a more thorough discussion of hyperbolic neural networks, we refer the reader to the paper [7].
Practical aspects of implementation
While implementing most of the formulas described above is straightforward, we employ some tricks to make the training more stable.

•
To ensure numerical stability we perform clipping by norm after applying the exponential map, which constrains the norm to not exceed $\frac{1}{\sqrt{c}}(1{10}^{3})$.

•
Some of the parameters in the aforementioned layers are naturally elements of ${\mathbb{D}}_{n}^{c}$. While in principle it is possible to apply Riemannian optimization techniques to them (e.g., previously proposed Riemannian Adam optimizer [2]), we did not observe any significant improvement. Instead, we parametrized them via ordinary Euclidean parameters which were mapped to their hyperbolic counterparts with the exponential map and used the standard Adam optimizer.

•
We found that the value of $c$ may affect the performance, especially for large dimensions of the Poincaré ball. In our experiments, the value $c=0.05$ showed good results for broad range of dimensions.
5 Experiments
Experimental setup
We start with a toy experiment supporting our abovementioned hypothesis that the distance to the center in Poincaré ball indicates a model uncertainty. To do so, we first train the MLR classifier in hyperbolic space on the MNIST dataset [16] and evaluate it on the Omniglot dataset [15]. We then investigate and compare the obtained distributions of distances to the origin of hyperbolic embeddings of the MNIST and Omniglot test sets.
In our further experiments, we concentrate on the fewshot classification and person reidentification tasks. The experiments on the Omniglot dataset serve as a starting point, and then we move towards more complex datasets. Afterwards, we consider two datasets, namely: MiniImageNet [24] and CaltechUCSD Birds2002011 (CUB) [40]. Here, for each dataset, we train four models: for oneshot fiveway and fiveshot fiveway classification tasks both in the Euclidean and hyperbolic spaces. Finally, we provide the reidentification results for the two popular datasets: Market1501 [46] and DukeMTMD [25, 47]. Further in this section, we provide a thorough description of each experiment.
Our code is available at github^{1}^{1} 1 https://github.com/KhrulkovV/hyperbolicimageembeddings.
5.1 Distance to the origin as the measure of uncertainty
In this subsection, we validate our hypothesis which claims that if one trains a hyperbolic classifier, then a distance of the Poincaré ball embedding of an image can serve as a good measure of confidence of a model. We start by training a simple hyperbolic convolutional neural network (ConvNet) on the MNIST dataset, consisting of three convolutional layers of size $5\times 5$ with $64$ filters, followed by three linear layers with hidden size $256$, except for the last linear layer, where hidden size corresponded to the embedding dimensionality and was varied. Parametric ReLU nonlinearity was used between all the layers. The output of the last hidden layer was mapped to the Poincaré ball using the exponential map (4) and was followed by the hyperbolic MLR layer. After training the model to $\sim 99\%$ test accuracy, we evaluate it on the Omniglot dataset (by resizing images to $28\times 28$ and normalizing them to have the same background color as MNIST). We then evaluate the hyperbolic distance to the origin of embeddings produced by the network on both datasets. The closest Euclidean analogue to this approach would be comparing distributions of ${p}_{\mathrm{max}}$, maximum class probability predicted by the network. For the same range of dimensions we train ordinary Euclidean classifiers on MNIST, and compare these distributions for the same sets. Our findings are summarized in Figure 4 and Table 1. We observe that distances to the origin present a more statistically significant indicator of the dataset dissimilarity in $3$ cases.
We have visualized the learned MNIST and Omniglot embeddings on Figure 1. We observe that more ‘unclear’ images are located near the center, while the images that are easy to classify are located closer to the boundary.
$n=2$  $n=8$  $n=16$  $n=32$  
${d}_{\mathbb{D}}(\mathbf{x},\mathrm{\U0001d7ce})$  $\mathbf{0.868}$  $0.832$  $\mathbf{0.853}$  $\mathbf{0.859}$ 
${p}_{\mathrm{max}}(\mathbf{x})$  $0.834$  $\mathbf{0.835}$  $0.840$  $0.846$ 

5.2 Omniglot fewshot classification
We hypothesize that a certain class of problems – namely the fewshot classification task can benefit greatly from hyperbolic embeddings.
The starting point for our analysis is the experiments on the Omniglot dataset for fewshot classification. This dataset consists of the images of $1623$ characters sampled from $50$ different alphabets; each character is supported by $20$ examples. We test several fewshot learning algorithms to see how hyperbolic embeddings affect them. As a baseline approach, we chose a siamese net with the backbone consisting of $4$ convolutional blocks followed by two fully connected layers, producing embeddings of size $n$. In order to produce the similarity score $s(\mathbf{x},\mathbf{y})$ for two input images (which is then fed into sigmoid, predicting whether these two images are of the same class), we consider three approaches. First one is Euclidean, for which we follow [12], with $s(\mathbf{x},\mathbf{y})={\sum}_{i}{\alpha}_{i}{f}^{(i)}(\mathbf{x}){f}^{(i)}(\mathbf{y})$. The other two use hyperbolic layers, for which we as before extend the backbone net with the exponential map. We try the following two approaches to compute the similarity between two hyperbolic embeddings. The first one is straightforward and we define similarity as $s(\mathbf{x},\mathbf{y})={d}_{c}(f(\mathbf{x}),f(\mathbf{y})$. For the second one, we first concatenate the embeddings using the hyperbolic $\mathrm{Concat}$ layer (7), mapping these two embeddings into one hyperbolic point of size $n$. This point is then moved to ${\mathbb{R}}^{n}$ via the logarithmic map (5), and mapped to a number with a fully connected layer of size $n\times 1$. We test these approaches for different values of $n$ in the $1$shot $20$way setting and report the results in Table 2. We observe that in both cases we get an accuracy boost, but slightly smaller with the $\mathrm{Concat}$ architecture. Value $c=1$ was used for these experiments.
In order to validate if hyperbolic embeddings can further improve models performing on the stateoftheart level, for the next architecture, we choose the prototype network (ProtoNet) introduced in the paper [31] with four convolutional blocks in a backbone. Each convolutional block consists of $3\times 3$ convolutional layer followed by batch normalization, ReLU nonlinearity and $2\times 2$ maxpooling layer. The number of filters in the last convolutional layer corresponds to the value of the embedding dimension, for which we choose $64$. The hyperbolic model differs from the baseline in the following aspects. First, the output of the last convolutional block is embedded into the Poincaré ball of dimension $64$ using the exponential map. In ProtoNet, one uses a socalled prototype representation of a class, which is defined as a mean of the embedded support set of a class. Generalizing this concept to hyperbolic space, we substitute the Euclidean mean operation by $\mathrm{HypAve}$, defined earlier in the Equation (8). We selected $c=0.05$ and trained the models for various fewshot scenarios. The initial value of learning rate equals to ${10}^{3}$ and is multiplied by $0.5$ every $20$ epochs out of total $60$ epochs. Results are presented in Table 3. We can see that in some scenarios, in particular for one–shot learning, hyperbolic embeddings are more beneficial, while in other cases results are slightly worse.
dim  $128$  $256$  $512$ 

Euclidean similarity 
$90.08$  $91.03$  $91.32$ 
Hyperbolic $\mathrm{Concat}$  $93.63$  $94.23$  $\mathbf{94.62}$ 
Hyperbolic distance  $\mathbf{95.54}$  $\mathbf{95.00}$  $19.25$ 
ProtoNet  Hyperbolic ProtoNet  

$1$shot $5$way  $98.2$  $\mathbf{99.0}$ 
$5$shot $5$way  $99.4$  $99.4$ 
$1$shot $20$way  $95.8$  $\mathbf{95.9}$ 
$5$shot $20$way  $\mathbf{98.6}$  $98.15$ 
Dataset  Model  1shot 5way  5shot 5way 

MiniImageNet  MatchNet [39]  $43.56\pm 0.84$  $55.31\pm 0.73$ 
ProtoNet  $48.29\pm 0.19$  $66.11\pm 0.16$  
RelationNet [35]  $50.44\pm 0.82$  $65.32\pm 0.70$  
Hyperbolic ProtoNet  $\mathbf{51.57}\pm \mathbf{0.2}$  $\mathbf{66.27}\pm \mathbf{0.17}$  
CUB  ProtoNet  $54.58\pm 0.24$  $68.04\pm 0.19$ 
Hyperbolic ProtoNet  $\mathbf{60.52}\pm \mathbf{0.25}$  $\mathbf{72.22}\pm \mathbf{0.19}$ 
5.3 MiniImageNet fewshot classification
MiniImageNet dataset is the subset of ImageNet dataset [26], which contains of $100$ classes represented by $600$ examples per class. We use the following split provided in the paper [24]: training dataset consists of $64$ classes, validation dataset is represented by $16$ classes, and the remaining $20$ classes serve as a test dataset.
As a baseline model, we once again consider the ProtoNet model, mentioned earlier in the experiments on Omniglot dataset in Subsection 5.2. In these experiments, we set the value of the embedding dimension to $1024$. We test the models on tasks for oneshot and fiveshot classifications; the number of query points in each batch always equals to $15$. Similarly, for the hyperbolic version of ProtoNet, we have the following experimental setup. We put the value of Poincaré ball radius to $c=0.05$, and consider the following learning rate decay scheme: the initial learning rate equals to ${10}^{3}$ and is further multiplied by $0.2$ every $10$ epochs (out of total $200$ epochs).
Table 4 illustrates the obtained results on MiniImageNet dataset. For MiniImageNet dataset, the results of the other models are available for the same classification tasks (i.e., for oneshot and fiveshot learning). Therefore, we can compare our obtained results to those that were reported in the original papers. From these experimental results, we may observe a slight gain in model accuracy.
5.4 CaltechUCSD Birds fewshot classification
The CUB dataset consists of $11,788$ images of $200$ bird species and was designed for finegrained classification. We use the split introduced in [37]: $100$ classes out of $200$ were used for training, $50$ for validation and $50$ for testing. Also, following [37], we make the same preprocessing step by resizing each image to the size of $64\times 64$.
Likewise, we use ProtoNet mentioned above with the following modifications. Here, we fix the embedding dimension to $512$ and use a slightly different setup for learning rate scheduler: the initial learning rate of value ${10}^{3}$ is multiplied by $0.7$ every $20$ epochs out of total $100$ epochs. Remaining architecture and parameters both in baseline and hyperbolic models are identical to those in the experiments on the MiniImageNet dataset.
Our findings on the experiments on the CUB dataset are summarized in Table 4. Interestingly, for this dataset, the hyperbolic version significantly outperforms its Euclidean counterpart.
5.5 Person reidentification
.5in.5in

Market1501  DukeMTMCreID  

bs  hyp  bs  hyp  
dim 
lr schedule  r1  mAP  r1  mAP  r1  mAP  r1  mAP 
32  sch#1  \ul71.4  \ul49.7  69.8  45.9  56.1  35.6  56.5  34.9 
sch#2  68.0  43.4  75.9  51.9  \ul57.2  \ul35.7  62.2  39.1  
64  sch#1  80.3  60.3  \ul83.1  \ul60.1  69.9  48.5  70.8  48.6 
sch#2  80.5  57.8  84.4  62.7  68.3  45.5  \ul70.7  \ul48.6  
128  sch#1  86.0  67.3  87.8  68.4  \ul74.1  \ul53.3  76.5  55.4 
sch#2  \ul86.5  \ul68.5  86.4  66.2  71.5  51.5  74.0  52.2 
The DukeMTMCreID dataset contains $16,522$ training images of $702$ identities, $2228$ query images of $702$ identities and $17,661$ gallery images. Market1501 contains $12936$ training images of $751$ identities, $3368$ queries of $750$ identities and $15913$ gallery images respectively. We report Rank1 of the Cumulative matching Charcteric Curve and Mean Average Precision for both datasets. We use ResNet50 [10] architecture with one fully connected embedding layer following the global average pooling. Three embedding dimensionalities are used in our experiments: $32$, $64$ and $128$. For the baseline experiments, we add the additional classification linear layer, followed by the crossentropy loss. For the hyperbolic version of the experiments, we map the descriptors to the Poincaré ball and apply multiclass logistic regression as described in Section 4. We found that in both cases the results are very sensitive to the learning rate schedules. We tried four schedules for learning $32$dimensional descriptors for both baseline and hyperbolic versions. Two best performing schedules were applied for the $64$ and $128$dimensional descriptors. In these experiments, we also found that smaller $c$ values give better results. We finally set $c$ to ${10}^{5}$. Therefore, based on the discussion in 4, our hyperbolic setting is quite close to Euclidean. The results are compiled in Table 5. We set starting learning rates to $3\cdot {10}^{4}$ and $6\cdot {10}^{4}$ for $sch\mathrm{\#}1$ and $sch\mathrm{\#}2$ correspondingly and multiply them by $0.1$ after each of the epochs $200$ and $270$. The results are reported after the $300$ training epochs. As we can see in the Table 5, hyperbolic version generally performs better than the baseline, while the gap between the baseline and hyperbolic versions’ results is decreasing for larger dimensionalities.
6 Discussion and conclusion
We have investigated the use of hyperbolic spaces for image embeddings. The models that we have considered use Euclidean operations in most layers, and use the exponential map to move from the Euclidean to hyperbolic spaces at the end of the network (akin to the normalization layers that are used to map from the Euclidean space to Euclidean spheres). The approach that we investigate here is thus compatible with existing backbone networks trained in Euclidean geometry.
At the same time, we have shown that across a number of tasks, in particular in the fewshot image classification, learning hyperbolic embeddings can result in a substantial boost in accuracy. We speculate that the negative curvature of the hyperbolic spaces allows for embeddings that are better conforming to the intrinsic geometry of at least some image manifolds with their hierarchical structure.
Future work may include several potential modifications of the approach. We have observed that the use of hyperbolic embeddings improves performance for some problems and datasets, while not helping others. A better understanding of when and why the use of hyperbolic geometry is justified is therefore needed. Also, we note that while all hyperbolic geometry models are equivalent in the continuous setting, fixedprecision arithmetic used in real computers breaks this equivalence. In practice, we observed that care should be taken about numeric precision effects (following [7], we clip the embeddings to minimize numerical errors during learning). Using other models of hyperbolic geometry may result in more favourable floating point performance.
References
 [1] E. Ahmed, M. J. Jones, and T. K. Marks. An improved deep learning architecture for person reidentification. In Conf. Computer Vision and Pattern Recognition, CVPR, pages 3908–3916, 2015.
 [2] G. Becigneul and O.E. Ganea. Riemannian Adaptive Optimization Methods. In International Conference on Learning Representations, 2019.
 [3] X. Chang, T. M. Hospedales, and T. Xiang. Multilevel factorisation net for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2109–2118, 2018.
 [4] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017.
 [5] B. Dhingra, C. J. Shallue, M. Norouzi, A. M. Dai, and G. E. Dahl. Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313, 2018.
 [6] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017.
 [7] O. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pages 5350–5360, 2018.
 [8] C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, and N. de Freitas. Hyperbolic attention networks. In International Conference on Learning Representations, 2019.
 [9] Y. Guo and N.M. Cheung. Efficient and deep person reidentification using multilevel similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2335–2344, 2018.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [11] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1062–1071, 2018.
 [12] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
 [13] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [15] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. Oneshot learning by inverting a compositional causal process. In Advances in Neural Information Processing Systems, pages 2526–2534, 2013.
 [16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [17] Y. LeCun et al. Generalization and network design strategies. In Connectionism in perspective, volume 19. Citeseer, 1989.
 [18] N. Linial, A. Magen, and M. E. Saks. Low distortion euclidean embeddings of trees. Israel Journal of Mathematics, 106(1):339–348, 1998.
 [19] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
 [20] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6338–6347, 2017.
 [21] M. Nickel and D. Kiela. Learning continuous hierarchies in the Lorentz model of Hyperbolic geometry. In Proc. ICML, pages 3776–3785, 2018.
 [22] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016.
 [23] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
 [24] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. 2016.
 [25] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multicamera tracking. In European Conference on Computer Vision workshop on Benchmarking MultiTarget Tracking, 2016.
 [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [27] F. Sala, C. De Sa, A. Gu, and C. Ré. Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, pages 4457–4466, 2018.
 [28] F. Sala, C. D. Sa, A. Gu, and C. Ré. Representation tradeoffs for hyperbolic embeddings. In Proc. ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 4457–4466. JMLR.org, 2018.
 [29] R. Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pages 355–366. Springer, 2011.
 [30] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
 [31] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [32] K. Sohn. Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
 [33] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Posedriven deep convolutional model for person reidentification. In Proceedings of the IEEE International Conference on Computer Vision, pages 3960–3969, 2017.
 [34] Y. Suh, J. Wang, S. Tang, T. Mei, and K. Mu Lee. Partaligned bilinear representations for person reidentification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419, 2018.
 [35] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
 [36] A. Tifrea, G. Bécigneul, and O.E. Ganea. Poincaré glove: Hyperbolic word embeddings. arXiv preprint arXiv:1810.06546, 2018.
 [37] E. Triantafillou, R. Zemel, and R. Urtasun. Fewshot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2255–2265, 2017.
 [38] E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, pages 4170–4178, 2016.
 [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
 [40] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 dataset. 2011.
 [41] Y. Wang, Z. Chen, F. Wu, and G. Wang. Person reidentification with cascaded pairwise convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1470–1478, 2018.
 [42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
 [43] C.Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.
 [44] D. Yi, Z. Lei, and S. Z. Li. Deep metric learning for practical person reidentification. arXiv prepzrint arXiv:1407.4979, 2014.
 [45] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person reidentification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1077–1085, 2017.
 [46] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person reidentification: A benchmark. In Computer Vision, IEEE International Conference on, 2015.
 [47] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person reidentification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, 2017.