Abstract
Recent literature has shown several benefits of hyperbolic embedding ofgraphstructured data (GSD) in representing their structures and latentrelations. While several studies have explored the ability of hyperbolicembedding to represent data (for example, by quantifying their mean averageprecision) and their ability to produce better visualisations of clusters, onlyfew works exploited the effectiveness of hyperbolic embedding to performlearning on the initial GSD. Motivated by innovative ideas from the fields of Brain computer interfacesand Radar processing, this paper presents a new scheme for learning GSD basedon hyperbolic embedding, Riemannian barycentre (i.e. Fr\'echet or geometricmean) and $K$means algorithms as a significant tool that derives from it. Themain idea is as follows. Relying on the Riemannian barycentre, we define anotion of minimal variance which allows us to choose an embedding betweendifferent ones. This embedding is used thereafter together with $K$meansalgorithms to perform unsupervised clustering and in combination with thenearest neighbour rule to perform supervised learning. We demonstrate theperformance of the proposed framework through several experiments on realworldsocial networks and hierarchical GSD. The obtained results outperform theircounterparts in highdimensional Euclidean spaces and recent proposed geometricapproaches.
Quick Read (beta)
Learning graphstructured data using Poincaré embeddings and Riemannian Kmeans algorithms
Abstract
Recent literature has shown several benefits of hyperbolic embedding of graphstructured data (GSD) in representing their structures and latent relations. While several studies have explored the ability of hyperbolic embedding to represent data (for example, by quantifying their mean average precision) and their ability to produce better visualisations of clusters, only few works exploited the effectiveness of hyperbolic embedding to perform learning on the initial GSD.
Motivated by innovative ideas from the fields of Brain computer interfaces and Radar processing, this paper presents a new scheme for learning GSD based on hyperbolic embedding, Riemannian barycentre (i.e. Fréchet or geometric mean) and $K$means algorithms as a significant tool that derives from it. The main idea is as follows. Relying on the Riemannian barycentre, we define a notion of minimal variance which allows us to choose an embedding between different ones. This embedding is used thereafter together with $K$means algorithms to perform unsupervised clustering and in combination with the nearest neighbour rule to perform supervised learning. We demonstrate the performance of the proposed framework through several experiments on realworld social networks and hierarchical GSD. The obtained results outperform their counterparts in highdimensional Euclidean spaces and recent proposed geometric approaches.
Keywords: Poincaré embeddings, Riemannian barycentre, Hyperbolic $K$means clustering.
1 Introduction
In recent years, the idea of embedding data in new spaces has proven effective in many applications. Indeed, several techniques have become very popular because of their great ability to represent data while reducing the complexity and dimensionality of the space. For instance, Word2vec [25] and Glove [28] are widely used tools in natural language processing, Nod2vec [19], Graph2vec [26] and DeepWalk [29] are commonly used for community detection, link prediction and node classification in social networks [15].
A major achievement in recent years has been the discovery of hyperbolic embeddings [27, 13, 31]. Although it has been speculated since several years that hyperbolic spaces would better represent GSD than Euclidean spaces [18, 22, 11, 2], it is only recently that these speculations have been proven effective through concrete studies and applications [27, 13, 31]. As outlined by [27], Euclidean embeddings require large dimensions to capture certain complex relations such as the Wordnet noun hierarchy [16]. On the other hand, this complexity can be captured by a simple model of hyperbolic geometry such as the Poincaré disk of two dimensions [31]. Hyperbolic embeddings also provide better visualisation of clusters on graphs than Euclidean embeddings [13].
The present paper is concerned with learning GSD represented by an adjacency matrix. Examples of these GSD include social networks, hierarchical lexical databases such as Wordnet and Lexical entailments datasets such as Hyperlex [27, 13, 31]. In the state of the art, one can distinguish two different approaches for clustering this type of data. The first one applies pure clustering techniques on graphs such as spectral clustering algorithms [32], power iteration clustering [24] and label propagation [38]. The second one is twostep and may be called Euclidean clustering after (Euclidean) embedding. First it embedds data in Euclidean spaces using techniques such as Nod2vec, Graph2vec and DeepWalk and then applies traditional clustering techniques such as $K$means algorithms. This approach appeared notably in [37, 33, 35]. Our main objective throughout the paper is to present the hyperbolic counterpart of this approach.
The motivation for this paper comes from recent works in Brain computer interfaces [10] and Radar processing [8]. In particular [10] used the distance to the Riemannian barycentre on the space of covariance matrices to classify brain computer signals. [8] used the Riemannian median on the Poincaré disk to detect outliers in Radar data based on a similar idea. In this paper, we apply for the first time (to the best of our knowledge) these techniques in link with hyperbolic embeddings of GSD. Our main contributions can be summarised as follows:
Unsupervised learning on GSD: We consider the task of clustering nodes on a GSD with a known number of classes denoted $K$. For this, we first generate several hyperbolic embeddings on the Poincaré disk $$ following [27]. For each embedding, we run a Riemannian $K$means algorithm. Finally we keep the embedding with minimal total variance, a notion which we introduce. This procedure is evaluated on $5$ realdata social networks and compared with its analog on the Euclidean space with $10$ dimensions and with the recent clustering method proposed in [1]. Experiments show that our method outperforms these two approaches.
As another application, we considered the task of clustering a typical example of hierarchical GSD which is a subtree of Wordnet. We focused on the representation of clusters and the interpretation of their contents.
Supervised learning on GSD: We adapted the previous idea to the context of supervised learning by keeping the embedding which has minimal total variance on the training dataset. Non labelled nodes are then classified according to the nearest barycentre rule. This approach is compared with the recent hyperbolic supervised approach [14] based on support vector machines. Experiments show the advantage of our method over that of [14].
The rest of the paper is organised as follows. Section 2 reviews necessary optimisation tools on the Poincaré disk as a Riemannian manifold of negative curvature. In particular, we discuss existence, uniqueness and numerical computations of the Riemannian barycentre and then deduce a Riemannian $K$means clustering algorithm on this space. Using barycentres to unroll $K$means algorithms have had several applications including object detection and tracking, shape classification, and image segmentation [34, 23, 20, 17, 9]. In these works the reference space is the manifold of covariance matrices. Here, we adapt this idea to the Poincaré space. Section 3 reviews the Poincaré embedding as introduced in [27] and presents our approach to learn GSD. Finally Section 4 provides experimental results and a comparison with the state of the art^{1}^{1} 1 The package generating these results will be made public in the near future..
2 Statistical learning on the Poincaré disk
This section reviews the Riemannian geometry of $\mathbb{D}$, discusses the existence and uniqueness of the Riemannian barycentre, focuses on its numerical computations and as a consequence derives a Riemannian $K$means algorithm on this space.
The Poincaré disk is commonly equipped with the Riemannian metric known as the Poincaré metric and expressed as
$${g}_{p}(X,Y)=\frac{\u27e8X,Y\u27e9}{{(1{p}^{2})}^{2}}$$  (1) 
where $p\in \mathbb{D},X,Y\in {\mathbb{R}}^{2}$ and $\u27e8X,Y\u27e9$ is the scalar product on ${\mathbb{R}}^{2}$. The Riemannian metric (1) induces a Riemannian distance between ${z}_{0}$ and ${z}_{1}$ given as follows:
$$d({z}_{0},{z}_{1})=\frac{1}{2}\mathrm{log}\left(\left(1+\left\frac{{z}_{1}{z}_{0}}{1\overline{{z}_{0}}{z}_{1}}\right\right){\left(1\left\frac{{z}_{1}{z}_{0}}{1\overline{{z}_{0}}{z}_{1}}\right\right)}^{1}\right)$$ 
This distance can also be expressed as $\frac{1}{2}\text{arcosh}\left(1+\frac{{{z}_{1}{z}_{0}}^{2}}{(1{{z}_{0}}^{2})(1{{z}_{1}}^{2})}\right)$ which is half the distance used in [27].
The Poincaré metric (1) turns $\mathbb{D}$ into a Riemannian manifold of negative sectional curvature [21, 36]. As a result, $\mathbb{D}$ enjoys the property of existence and uniqueness of the Riemannian barycentre [3]. More precisely, for every set of points $\{{z}_{i},1\le i\le n\}$ in $\mathbb{D}$, the empirical Riemmanian barycentre ${\widehat{z}}_{n}$
$${\widehat{z}}_{n}={\text{argmin}}_{z\in \mathbb{D}}\left(\sum _{i=1}^{n}{d}^{2}(z,{z}_{i})\right)$$  (2) 
exists, is unique and belongs to $\mathbb{D}$. Several stochastic gradient algorithms can be applied to numerically approximate ${\widehat{z}}_{n}$ [12, 5, 7, 6]. In this paper, we will use the algorithm of [8] which has proven effective for Radar applications. For this, we first recall the definitions of the Riemannian exponential and logarithmic maps.
For given $z\in \mathbb{D}$ and $v\in {\mathbb{R}}^{2}$ the exponential map is
$${\text{Exp}}_{z}(v)=\frac{(z+{e}^{i\theta}){e}^{2v}+(z{e}^{i\theta})}{(1+\overline{z}{e}^{i\theta}){e}^{2v}+(1\overline{z}{e}^{i\theta})}$$ 
with
$$\theta =\mathrm{arg}(v),v=\frac{v}{(1{z}^{2})}$$ 
The exponential map ${\text{Exp}}_{z}$ is a diffeomorphism from ${\mathbb{R}}^{2}$ to $\mathbb{D}$. Its inverse, called the logarithmic map and denoted by ${\text{log}}_{z}$ is given by
$${\text{Log}}_{z}(y)=(1{z}^{2})\text{atanh}\mathit{\hspace{1em}}\left(\left\frac{yz}{1\overline{z}y}\right\right){e}^{i\theta}$$ 
with $\theta =\mathrm{arg}\left(\frac{yz}{1\overline{z}y}\right)$ and atanh is the inverse of the hyperbolic tangent. For more details on the previous formulas, we refer to [4, 36]. For numerical computation of the barycentre, we will use Algorithm 2 below from [8].
[!ht]
Inputs
$Z=\{{z}_{i},1\le i\le n\}$: a subset of $\mathbb{D}$ (complex), ${z}_{init}$: barycentre initialisation (complex), ${\tau}_{b}$: step size (strictly positive float)
Output $\widehat{z}$: numerical approximation of the Riemannian barycentre (complex)
[1] \State$\widehat{z}\leftarrow {z}_{init}$ \Commentrandom or a trickier initialisation (e.g. , average mean)
$d=1{e}^{6}$ \Commenta significantly large number
\Repeat\State $\mu \leftarrow \frac{2}{n}{\sum}_{i=1}^{n}{\text{Log}}_{\widehat{z}}({z}_{i})$, $\widehat{z}\leftarrow {\text{Exp}}_{\widehat{z}}({\tau}_{b}\mu )$, $d\leftarrow \sqrt{\frac{{\mu }^{2}}{{(1{\widehat{z}}^{2})}^{2}}}$
\Until$d$ is small
\Return$\widehat{z}$
A direct consequence of existence and uniqueness of the Riemannian barycentre is $K$means algorithm illustrated in Algorithm 2. In the description we use the word centroid to denote Riemannian barycentre.
[!ht]
Inputs $K$: number of clusters (integer), $Z$: set of $n$ complex numbers that are a subset of $\mathbb{D}$ (complex),${\tau}_{b}$: barycentre approximation step size (strictly positive float)
Output
$N$: set of $K$ centroids (complex), $labels$: $n$ labels of the input data (table of integers)
\[email protected]@algorithmic[1]
\StateInitialize $K$ centroids, $N=\{{\nu}_{1},{\nu}_{2},\mathrm{\dots},{\nu}_{K}\}$ randomly in $\mathbb{D}$
\Repeat\For${z}_{i}\in Z$
\State ${c}_{i}\leftarrow argmi{n}_{j\in \{1,\mathrm{\dots},K\}}d({z}_{i},{\nu}_{j})$ \Comment$d$ is the Riemannian distance
\EndFor\For$j\in \{1,\mathrm{\dots},K\}$
\State ${\nu}_{j}\leftarrow \mathrm{\U0001d681\U0001d692\U0001d68e\U0001d696\U0001d696\U0001d68a\U0001d697\U0001d692\U0001d68a\U0001d697}\mathrm{\_}\mathrm{\U0001d671\U0001d68a\U0001d69b\U0001d6a2\U0001d68c\U0001d68e\U0001d697\U0001d69d\U0001d68e\U0001d69b}(\{{z}_{i}{c}_{i}=j\},\mathrm{\U0001d681\U0001d68a\U0001d697\U0001d68d\U0001d698\U0001d696},{\tau}_{b}))$
\EndFor\Untilconvergence \Return$N$, $labels=\{{c}_{i}\}$
3 Learning GSD using Poincaré embeddings and Riemannian optimisation
This section starts by reviewing the approach of [27] to embed GSD in the Poincaré disk. Based on this embedding and $K$means algorithm introduced in the previous section, we present our algorithm to perform supervised and unsupervised clustering tasks.
Given a GSD $(\mathcal{V},\mathcal{E})$, where $\mathcal{V}$ and and
$\mathcal{E}$ are the nodes and edges datasets, the probability of an edge $e=\{u,v\}$ is modelled in [27] using the FermiDirac distribution
$$P(uv,\mathrm{\Theta})=\frac{1}{{e}^{d(u,v)}+1}$$ 
To embed $\mathcal{V}$ in $\mathbb{D}$, [27] learns a map $u\in \mathcal{V}\mapsto {\theta}_{u}\in \mathbb{D}$ by minimising
$$\sum _{u\in \mathcal{V}}\sum _{v\in \mathcal{N}(u)}\mathrm{log}(P(uv,\mathrm{\Theta})),\mathrm{\Theta}={\{{\theta}_{u}\}}_{u\in \mathcal{V}}$$  (3) 
where $\mathcal{N}(u)$ is the set of neighbours of $u$. Following [25], (3) is optimised by selecting small number of negative samples according to a priori distribution ${P}_{\mathcal{N}}$. Taking this into account, the objective function (3) can be written as
$$\mathcal{L}(\mathrm{\Theta})=\mathrm{log}(\sigma (d(u,v)))+\sum _{i=1}^{M}{\mathbb{E}}_{{P}_{\mathcal{N}}}[\mathrm{log}(\sigma (d({u}_{i},v)))]$$  (4) 
with $\sigma (x)={(1+{e}^{x})}^{1}$ the softmax function ^{2}^{2} 2 Since the Riemannian gradient of ${d}^{p}$ is given by $${\nabla}_{\theta}{d}^{p}(\theta ,{\theta}^{\prime})=p*{d}^{p1}(\theta ,{\theta}^{\prime})*\frac{lo{g}_{\theta}({\theta}^{\prime})}{d(\theta ,{\theta}^{\prime})}$$ (see [8]), we actually used ${d}^{2}$ instead of $d$ to avoid division by $d(\theta ,{\theta}^{\prime})$ and get better stability. We also notice that [27] proposes to maximise (4) for social networks GSD and to maximise $\mathcal{L}(\mathrm{\Theta})=\mathrm{log}\left(\frac{{e}^{d(u,v)}}{{\sum}_{{v}^{\prime}\in \mathcal{N}(u)}{e}^{d(u,{v}^{\prime})}}\right)$ for hierarchical GSD. We found that with the latter loss function, the embedded nodes quickly approach the boundary of the disk and that using (4) also for hierarchical GSD gives us more stable results. . In practice, $\mathcal{L}(\mathrm{\Theta})$ is optimised by generating nodes on the graph using DeepWalk [29] and then sampling from these nodes using the unigram distribution raised to $3/4$ [25].
Given a cluster $\mathcal{C}=\{{z}_{i},1\le i\le n\}$ in $\mathbb{D}$, with barycentre $\widehat{z}$, we define its variance as
$$\text{Var}(\mathcal{C})=\frac{1}{n}\sum _{i=1}^{n}{d}^{2}(\widehat{z},{z}_{i})$$ 
This definition is in accordance with the use of ${d}^{2}$ mentioned in the footnote. The variance will be used to choose one embedding between several ones. In fact, it is reasonable to give more confidence on the embedding for which the variance is minimal, that is points are more concentrated around barycentres. This idea will be justified empirically in the next section.
Finally, based on the Riemannian barycentre, Algorithm 3 below presents our scheme to perform unsupervised clustering. The main idea is to embed GSD in the Poincaré disk, form clusters using the ground truth data and lastly associate points to clusters according to the nearest barycentre rule.
In what follows, the Poincare_Embedding function, given an adjacency matrix of an input GSD, minimises (4) and outputs the embedding of every node on $\mathbb{D}$.
[!ht]
Inputs $\mathcal{A}$: adjacency matrix of a GSD with $0,1$ entries, ${P}_{embedding}$: an object containing the embedding parameters, $K$: number of classes,
${P}_{K\mathrm{\_}Means}$: an object containing the $K$means algorithm parameters, $NE$: number of experiments
Outputs $Best\mathrm{\_}Embedding$: embedding of input graph with minimal total variance, $N$: barycentres of each cluster, $labels$: cluster labels for each node
[1] \Repeat\CommentExperimentally each execution is run in parallel \State $Embeddin{g}_{i}\leftarrow \mathrm{\U0001d67f\U0001d698\U0001d692\U0001d697\U0001d68c\U0001d68a\U0001d69b\U0001d68e}\mathrm{\_}\mathrm{\U0001d674\U0001d696\U0001d68b\U0001d68e\U0001d68d\U0001d68d\U0001d692\U0001d697\U0001d690}(\mathcal{A},{P}_{embedding})$ \State ${N}_{i},label{s}_{i}\leftarrow \U0001d67a\mathrm{\_}\mathrm{\U0001d67c\U0001d68e\U0001d68a\U0001d697\U0001d69c}(K,Embeddin{g}_{i},{P}_{K\mathrm{\_}Means})$ \State $Clusters\leftarrow \{{c}_{1},\mathrm{\dots},{c}_{K}{c}_{j}=\{{z}_{q}\in Embeddin{g}_{i}label{s}_{i}[q]=j\}\}$ \State $va{r}_{i}=ma{x}_{j\in \{1,\mathrm{\dots},K\}}(Var({c}_{j}))\text{333 Experimentally we chose the second largest variance, because the cluster with maximum variance is often dispersed and does not reflect the quality of the embedding. This may be relaxed by taking higher dimensional hyperbolic balls.}$ \Until$NE$ embeddings have been computed \State
$$Best\mathrm{\_}Embedding\leftarrow Embeddin{g}_{best}best=argmi{n}_{i\in \{1,\mathrm{\dots},NE\}}(va{r}_{i})$$ 
$Best\mathrm{\_}Embedding,{N}_{best},label{s}_{best}$
Algorithm 3 below presents our scheme to perform supervised clustering. Details regarding the implementation aspects together with experimental results of Algorithms 3 and 3 will be provided and discussed in the next section.
Inputs $\mathcal{A}$: adjacency matrix of a GSD with $0,1$ entries, ${P}_{embedding}$: an object containing the embedding parameters (object), $Ground\mathrm{\_}Truth$: The ground truth of each node in the GSD (table of integers), ${\tau}_{b}$: barycentre approximation step size (strictly positive float)
Output $Class$: computed cluster of each node used for training (integer)
\[email protected]@algorithmic[1]
\State $Embedding\leftarrow \mathrm{\U0001d67f\U0001d698\U0001d692\U0001d697\U0001d68c\U0001d68a\U0001d69b\U0001d68e}\mathrm{\_}\mathrm{\U0001d674\U0001d696\U0001d68b\U0001d68e\U0001d68d\U0001d68d\U0001d692\U0001d697\U0001d690}(\mathcal{A},{P}_{embedding})$
\State $Train\mathrm{\_}Data,Test\mathrm{\_}Data\leftarrow \mathrm{\U0001d69c\U0001d699\U0001d695\U0001d692\U0001d69d}(Embedding)$
\State\CommentEmbedded nodes are divided into two parts
\State $\{{c}_{1},\mathrm{\dots},{c}_{K}\}\leftarrow $ Compute_Clusters$\mathrm{(}\mathrm{T}\mathtt{}\mathrm{r}\mathtt{}\mathrm{a}\mathtt{}\mathrm{i}\mathtt{}\mathrm{n}\mathtt{}\mathrm{\_}\mathtt{}\mathrm{D}\mathtt{}\mathrm{a}\mathtt{}\mathrm{t}\mathtt{}\mathrm{a}\mathrm{,}\mathrm{G}\mathtt{}\mathrm{r}\mathtt{}\mathrm{o}\mathtt{}\mathrm{u}\mathtt{}\mathrm{n}\mathtt{}\mathrm{d}\mathtt{}\mathrm{\_}\mathtt{}\mathrm{T}\mathtt{}\mathrm{r}\mathtt{}\mathrm{u}\mathtt{}\mathrm{t}\mathtt{}\mathrm{h}\mathrm{)}$
\For$q\in \{1,\mathrm{\dots},K\}$
\State ${\nu}_{q}\leftarrow \mathrm{\U0001d681\U0001d692\U0001d68e\U0001d696\U0001d68a\U0001d697\U0001d697\U0001d692\U0001d68a\U0001d697}\mathrm{\_}\mathrm{\U0001d671\U0001d68a\U0001d69b\U0001d6a2\U0001d68c\U0001d68e\U0001d697\U0001d69d\U0001d69b\U0001d68e}({c}_{q},\text{\U0001d681\U0001d68a\U0001d697\U0001d68d\U0001d698\U0001d696},{\tau}_{b})$
\EndFor\For$u\in Test\mathrm{\_}Data$
\State $Class(u)\leftarrow argmi{n}_{p\in \{1,\mathrm{\dots},K\}}(d({\nu}_{p},u))$ \Comment$d$ is the Riemannian distance
\EndFor\State\Return$Class$
\State\CommentPerformances are then obtained by comparing the ground truth with the computed centroids of the $Test\mathrm{\_}Data$.
4 Experimental results
The algorithms from the previous sections are implemented as a package that does the following: given the adjacency matrix of a binary graph as input, it performs embedding over the Poincaré disk, applies Riemannian $K$means clustering and provides visualisation of the computed clusters.
The package is implemented in Python and makes use of multiprocessing to run a number of experiments in parallel. All computations are performed on a machine using four cores equipped with an Intel Core i5 running at a 2.71 GHz frequency. The threshold ${\tau}_{b}$ is set to $0.005$ for all experiments. The datasets used in the paper are given in Table 1 with their number of nodes (Nodes) and their number of edges (Edges) [30].
Dataset  Nodes  Edges  $K$ 

Karate  34  77  2 
Polblogs  1224  16781  2 
Polbooks  105  441  3 
Football  115  613  12 
Adjnoun  112  425  2 
Mammals subtree  1179  6541  NA 
4.1 Unsupervised clustering
Social Networks. In this part, we are interested in applying the previous algorithms to the datasets presented above.
Comparison criterion.
For each dataset 10 experiments are performed. Each experiment is conducted in two steps. In the first step, we generate a Poincaré and an Euclidean embedding which uses DeepWalk [29]. An intermediate step of the latter algorithm is the generation of random walks that captures the structure of the graph. The same random walks are used for both embeddings. In the second step, we apply Riemannian (resp. Euclidean) $K$means algorithm over the embedding. For the Euclidean embedding, we set the space dimension to 10 and use the Euclidean distance and barycentre. For each dataset, we choose the embedding having the second greatest variance as given in Algorithm 3 as the best one (in the Riemannian and Euclidean case). Finally, for each dataset, we computed the mean average performance of the $10$ embeddings (Riemannian and Euclidean). The results are presented in Table 2 with the following abbreviations:
PBPE: Performance of the best Poincaré embedding
PBEE: Performance of the best Euclidean embedding
APPE: Average performance of the $10$ Poincaré embeddings
APEE: Average performance of the $10$ Euclidean embeddings
Dataset  PBPE  PBEE  APPE  APEE  [1] 

Karate  91.2%  70.6%  91.4%  65.8%   
Polblogs  92.8%  51.9%  92.5%  53.5%   
Polbooks  84.8%  77.1%  80%  62%  75% 
Football  87%  67.8%  69.4%  56.8%  77% 
Adjnoun  51.8%  51.8 %  52.5%  51.6%  51% 
Performance increase with the number of experiments.
In order to justify our choice of the best embedding as having the minimal total variance, we plot for the Football dataset the evolution of the PBPE with respect to the number of experiments NE. The obtained plot (Figure 2), shows indeed that the PBPE increases as NE grows from 1 to 10.
Hierarchical GSD. In this part, we are interested in applications over hierarchical GSD. We consider an example from Wordnet which is the mammals subtree. Figure 3 illustrates the obtained clusters with $K=6$. The barycentres are represented with squares. Figure 4 then shows explicitly some of the nodes labels, chosen randomly. A focus is given for nodes near barycentres on one hand and at boundaries between distinct clusters on the other hand.
Notice that the obtained clusters discern between different types of mammals. For example the blue cluster contains mostly canine mammals while the orange one contains mostly larger mammals (lion, tigress and so on…).
4.2 Supervised learning
In this section, we exploit Riemannian barycentres to perform supervised clustering. Each dataset from the previous list is divided into five parts with (almost) equal sizes. The ground truth of $4/5$ the data will be used to define the clusters and the remaining $1/5$ will serve for testing in a crossvalidation fashion. Following Algorithm 3, each element from the test dataset is assigned to one cluster according to the nearest neighbor barycentre rule. This experiment is performed 10 times. Finally the performance of this method is evaluated using ground truth of the test data. Table 3 presents the obtained results with the following abbreviation:
CVMAP: Average performance obtained using crossvalidation over the number of performed experiments.
Dataset  CVMAP  [14] 

Karate  93.9%  86% 
Polblogs  92.4%  93% 
Polbooks  83.3%  73% 
Football  77.9 %  24% 
Adjnoun  57.8 %   
4.3 Discussion and comparison with the state of the art
First we point out that using the notion of minimal variance it is always possible to generate more embeddings than what we did previously (10 embeddings) and it is also possible to increase the dimension of the Poincaré ball as in [27]. This reasoning takes advantage of the nice properties of hyperbolic manifolds and is completely unsupervised in the sense that it does not require any ground truth. Thus, improvements in the results given above remain possible in both supervised and unsupervised settings.
Unsupervised clustering. In this paragraph, we comment on results of Table 2. First the advantages of Poincaré clustering over Euclidean clustering are straightforward and confirm that hyperbolic spaces represent GSD more suitably than Euclidean spaces. Regarding Poincaré clustering, notice that the best embedding as defined before is not necessarily the one with the highest clustering performance since in two situations PBPE$$APPE. However in both cases, the gaps are slight and do not exceed $0.7\%$. The advantage of PBPE is more visible for the Football dataset where it largely exceeds APPE (with more than $17\%$) and for the Polbooks dataset where it exceeds APPE with $4.8\%$. Our results outperform that of [1] whose authors proposed an embedding on generalised surfaces and considered the datasets Polbooks, Football, Adjnoun for testing. They obtained approximate successes rates of $75,77,51\%$ respectively. The improvements for these datasets using our proposed scheme are significant: $\mathbf{\text{9.8}},\text{\U0001d7cf\U0001d7ce}$ and $\mathbf{\text{0.8}}\%$.
Supervised clustering. In this paragraph, we compare our results with that of [14] in which the authors used a generalisation of the SVM method to the Poincaré disk. [14] considered the datasets Karate, Polbooks, Football, Polblogs and reported mean approximate successes of $86,73,24,93\%$ respectively over $5$ crossvalidation trials over $5$ different embeddings using the embedding of [13]. We obtained significant improvements for the datasets Karate, Polbooks, Football of 7.9, 10.3, and $\mathbf{\text{53.9}}\%$ with a slight gap of $0.6\%$ for the Polblogs dataset.
4.4 Multidimensional embedding and clustering
In a multidimensional setting, we considered $K$means clustering after Poincaré embeddings in the hyperbolic space given by the product $\mathcal{P}=\mathbb{D}\times \mathrm{\cdots}\times \mathbb{D}$ of $n$ Poincaré disks. This space is equipped with the product metric. The current publicly available Python implementation provides this multidimensional setting. However, experimentally, we did not observe a significant increase in performances for the values of $n$ ranging till $10$ for the datasets used in this paper. This may be explained by the fact that these datasets are not very large. In future work, we aim to experiment with larger datasets while increasing the dimension of the hyperbolic space in order to better study the scalability of our approach.
5 Conclusion
Several recent studies [27, 13, 14, 31] have concluded that hyperbolic spaces, even in small dimensions, are more suitable embedding spaces than Euclidean spaces for representing a GSD. In this paper, we used Poincaré embeddings [25] to propose a new method for clustering GSD data. This method is based on Riemannian $K$means algorithms and a notion of minimal variance that allowed us to choose one embedding among several ones.
The proposed method has been tested on several datasets and has shown improvements in the stateoftheart for both supervised and unsupervised clustering tasks. In particular, these performances were achieved at minimal cost:

•
We used the Poincaré disk of only $2$ dimensions.

•
We got visualisation with high level representation of clusters on graphs.
Our results on clustering with the DeepWalk Euclidean embedding, suggest that getting good performances with this approach or other varieties of it such as Graph2vec [26], Nod2vec [19] would need very high dimensional Euclidean representation spaces.
Acknowledgement. The first author is grateful to Jeanine Harb for drawing his attention to the paper [27] during the machine learning seminar at IRT SystemX.
References
 [1] Aalto, M., Verma, N.: Metric learning on manifolds. CoRR https://arxiv.org/abs/1902.01738 (2018)
 [2] Adcock, A.B., Sullivan, B.D., Mahoney, M.W.: Treelike structure in large social and information networks. In: 2013 IEEE 13th International Conference on Data Mining. pp. 1–10 (Dec 2013)
 [3] Afsari, B.: Riemannian ${L}^{p}$ center of mass: existence, uniqueness and convexity. Proc. Amer. Math. Soc. 139(2), 655–6673
 [4] Anderson, J.W.: Hyperbolic geometry; 2nd ed. Springer undergraduate mathematics series, Springer, Berlin (2005), https://cds.cern.ch/record/1164418
 [5] Arnaudon, M., Dombry, C., Phan, A., Yang, L.: Stochastic algorithms for computing means of probability measures. Stoch. Proc. Appl. 58(9), 1473–1455 (2012)
 [6] Arnaudon, M., Miclo, L.: Means in complete manifolds : uniqueness and approximation. ESAIM Probability and statistics 18, 185–206 (2014)
 [7] Arnaudon, M., Yang, L., Barbaresco, F.: Stochastic algorithms for computing pmeans of probability measures, geometry of Radar Toeplitz covariance matrices and applications to HR Doppler processing. In: International Radar Symposium (IRS). pp. 651–656 (2011)
 [8] Arnaudon, M., Barbaresco, F., Yang, L.: Riemannian medians and means with applications to radar signal processing. J. Sel. Topics Signal Processing 7(4), 595–604 (2013)
 [9] Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (Dec 2005), http://dl.acm.org/citation.cfm?id=1046920.1194902
 [10] Barachant, A., Bonnet, S., Congedo, M., Jutten, C.: Multiclass braincomputer interface classification by riemannian geometry. IEEE Trans. Biomed. Engineering 59(4), 920–928 (2012)
 [11] Boguñá, M., Papadopoulos, F., Krioukov, D.: Sustaining the Internet with Hyperbolic Mapping. Nature Communications 1(62) (Oct 2010)
 [12] Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control. 122(4), 2217–2229 (2013)
 [13] Chamberlain, B.P., Clough, J., Deisenroth, M.P.: Neural embeddings of graphs in hyperbolic space. 13th international workshop on mining and learning with graphs. Available at https://arxiv.org/abs/1705.10359 (2017)
 [14] Cho, H., Demeo, B., Peng, J., Berger, B.: Largemargin classification in hyperbolic space. CoRR abs/1806.00437 (2018)
 [15] Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. CoRR abs/1711.08752 (2017)
 [16] Fellbaum, C. (ed.): WordNet: an electronic lexical database. MIT Press (1998)
 [17] Goh, A., Vidal, R.: Clustering and dimensionality reduction on riemannian manifolds. In: CVPR. IEEE Computer Society (2008)
 [18] Gromov, M.: Hyperbolic Groups, pp. 75–263. Springer New York, New York, NY (1987)
 [19] Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining 2016, 855–864 (2016)
 [20] Gu, X., Deng, J.D., Purvis, M.K.: Improving superpixelbased image segmentation by incorporating color covariance matrix manifolds. In: ICIP. pp. 4403–4406. IEEE (2014)
 [21] Helgason, S.: Differential geometry, Lie groups, and symmetric spaces. American Mathematical Society (2001)
 [22] Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., Boguñá, M.: Hyperbolic geometry of complex networks. Phys. Rev. E 82, 036106 (Sep 2010)
 [23] Kurtek, S., Klassen, E., Ding, Z., Jacobson, S., Jacobson, J.B., Avison, M., Srivastava, A.: Parameterizationinvariant shape comparisons of anatomical surfaces. IEEE Trans. Med. Imaging 30(3), 849–858 (2011)
 [24] Lin, F., Cohen, W.W.: Power iteration clustering. In: ICML (2010)
 [25] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013), http://papers.nips.cc/paper/5021distributedrepresentationsofwordsandphrasesandtheircompositionality.pdf
 [26] Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec: Learning distributed representations of graphs. CoRR abs/1707.05005 (2017)
 [27] Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems 30, pp. 6338–6347. Curran Associates, Inc. (2017)
 [28] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: In EMNLP (2014)
 [29] Perozzi, B., AlRfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 701–710. KDD ’14, ACM, New York, NY, USA (2014). https://doi.org/10.1145/2623330.2623732, http://doi.acm.org/10.1145/2623330.2623732
 [30] Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence (2015), http://networkrepository.com
 [31] Sala, F., Sa, C.D., Gu, A., Ré, C.: Representation tradeoffs for hyperbolic embeddings. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018. pp. 4457–4466 (2018), http://proceedings.mlr.press/v80/sala18a.html
 [32] Spielmat, D.A.: Spectral partitioning works: Planar graphs and finite element meshes. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science. pp. 96–. FOCS ’96, IEEE Computer Society, Washington, DC, USA (1996), http://dl.acm.org/citation.cfm?id=874062.875505
 [33] Tu, C., Wang, H., Zeng, X., Liu, Z., Sun, M.: Communityenhanced network representation learning for network analysis. CoRR abs/1611.06645 (2016), http://arxiv.org/abs/1611.06645
 [34] Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classification on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 30(10), 1713–1727 (2008), http://dblp.unitrier.de/db/journals/pami/pami30.html#TuzelPM08
 [35] Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1225–1234. KDD ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939753, http://doi.acm.org/10.1145/2939672.2939753
 [36] Yang, L.: Médianes de mesures de probabilité dans les variétés riemanniennes et applications à la détection de cibles radar. PhD thesis (2011), http://www.theses.fr/2011POIT2311
 [37] Zheng, V.W., Cavallari, S., Cai, H., Chang, K.C., Cambria, E.: From node embedding to community embedding. CoRR abs/1610.09950 (2016), http://arxiv.org/abs/1610.09950
 [38] Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Tech. rep. (2002)