Abstract
Learning useful representations is a key ingredient to the success of modernmachine learning. Currently, representation learning mostly relies on embeddingdata into Euclidean space. However, recent work has shown that data in somedomains is better modeled by noneuclidean metric spaces, and inappropriategeometry can result in inferior performance. In this paper, we aim to eliminatethe inductive bias imposed by the embedding space geometry. Namely, we proposeto map data into more general nonvector metric spaces: a weighted graph with ashortest path distance. By design, such graphs can model arbitrary geometrywith a proper configuration of edges and weights. Our main contribution isPRODIGE: a method that learns a weighted graph representation of dataendtoend by gradient descent. Greater generality and fewer model assumptionsmake PRODIGE more powerful than existing embeddingbased approaches. We confirmthe superiority of our method via extensive experiments on a wide range oftasks, including classification, compression, and collaborative filtering.
Quick Read (beta)
Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs
Abstract
Learning useful representations is a key ingredient to the success of modern machine learning. Currently, representation learning mostly relies on embedding data into Euclidean space. However, recent work has shown that data in some domains is better modeled by noneuclidean metric spaces, and inappropriate geometry can result in inferior performance. In this paper, we aim to eliminate the inductive bias imposed by the embedding space geometry. Namely, we propose to map data into more general nonvector metric spaces: a weighted graph with a shortest path distance. By design, such graphs can model arbitrary geometry with a proper configuration of edges and weights. Our main contribution is PRODIGE: a method that learns a weighted graph representation of data endtoend by gradient descent. Greater generality and fewer model assumptions make PRODIGE more powerful than existing embeddingbased approaches. We confirm the superiority of our method via extensive experiments on a wide range of tasks, including classification, compression, and collaborative filtering.
Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs
Denis Mazur Yandex denismazur[email protected] Vage Egiazarian Skoltech [email protected] Stanislav Morozov Yandex Lomonosov Moscow State University [email protected] Artem Babenko Yandex National Research University Higher School of Economics [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Nowadays, representation learning is a major component of most data analysis systems; this component aims to capture the essential information in the data and represent it in a form that is useful for the task at hand. Typical examples include word embeddings [23, 26, 1], image embeddings [5, 2], user/item representations in recommender systems [10] and others. To be useful in practice, representation learning methods should meet two main requirements. Firstly, they should be effective, i.e., they should not lose the information needed to achieve high performance in the specific task. Secondly, the constructed representations should be efficient, e.g., have small dimensionality, sparsity, or other structural constraints, imposed by the particular machine learning pipeline. In this paper, we focus on compact data representations, i.e., we aim to achieve the highest performance with the smallest memory consumption.
Most existing methods represent data items as points in some vector space, with the Euclidean space ${\mathbb{R}}^{n}$ being a "default" choice. However, several recent works [24, 9, 4] have demonstrated that the quality of representation is heavily influenced by the geometry of the embedding space. In particular, different space curvature can be more appropriate for different types of data [9]. While some prior works aim to learn the geometrical properties of the embedding space from data[9], all of them assume a vectorial representation of the data, which may be an unnecessary inductive bias.
In contrast, we investigate a more general paradigm by proposing to represent finite datasets as weighted graphs equipped with a shortest path distance. It can be shown that such graphs can represent any geometry for a finite dataset[11] and can be naturally treated as finite metric spaces. Specifically, we introduce Probabilistic Differentiable Graph Embeddings (PRODIGE), a method that learns a weighted graph from data, minimizing the problemspecific objective function via gradient descent. Unlike existing methods, PRODIGE does not learn vectorial embeddings of the data points, instead information from the data is effectively stored in the graph structure. Via extensive experiments on several different tasks, we confirm that, in terms of memory consumption, PRODIGE is more efficient than its vectorial counterparts.
To sum up, the contributions of this paper are as follows:

1.
We propose a new paradigm for constructing representations of finite datasets. Instead of constructing pointwise vector representations, our proposed PRODIGE method represents data as a weighted graph equipped with a shortest path distance.

2.
Applied to several tasks, PRODIGE is shown to outperform vectorial embeddings when given the same memory budget; this confirms our claim that PRODIGE can capture information from the data more effectively.

3.
The PyTorch source code of PRODIGE is available online^{1}^{1} 1 https://github.com/stanismorozov/prodige.
The rest of the paper is organized as follows. We discuss the relevant prior work in Section 2 and describe the general design of PRODIGE in Section 3. In Section 4 we consider several practical tasks and demonstrate how they can benefit from the usage of PRODIGE as a dropin replacement for existing vectorial representation methods. Section 5 concludes the paper.
2 Related work
In this section, we briefly review relevant prior work and describe how the proposed approach relates to the existing machine learning concepts.
Embeddings. Vectorial representations of data have been used in machine learning systems for decades. The first representations were mostly handengineered and did not involve any learning, e.g. SIFT[20] and GIST[25] representations for images, and ngram frequencies for texts. The recent success of machine learning is largely due to the transition from the handcrafted to learnable data representations in domains, such as NLP[23, 26, 1], vision[5], speech[3]. Most applications now use learnable representations, which are typically vectorial embeddings in Euclidean space.
Embedding space geometry. It has recently been shown that Euclidean space is a suboptimal model for several data domains [24, 9, 4]. In particular, spaces with a hyperbolic geometry are more appropriate to represent data with a latent hierarchical structure, and several papers investigate hyperbolic embeddings for different applications [27, 28, 16]. The current consensus appears to be that different geometries are appropriate for solving different problems and there have been attempts to learn these geometrical properties (e.g. curvature) from data [9]. Instead, our method aims to represent data as a weighted graph with a shortest path distance; this by design can express an arbitrary finite metric space.
Connections to graph embeddings. Representing the vertices of a given graph as vectors is a longstanding problem in machine learning and complex networks communities. Modern approaches to this problem rely on graph theory and/or graph neural networks [29, 13], which are both areas of active research. In some sense, we aim to solve the inverse problem; given data entities, our goal is to learn a graph approximating the distances that satisfy the requirements of the task at hand.
Existing works on graph learning. We are not aware of prior work that proposes a general method to represent data as a weighted graph for any differentiable objective function. Probably, the closest work to ours is [6]; they solve the problem of distancepreserving compression via a dual optimization problem. Their proposed approach lacks endtoend learning and does not generalize to arbitrary loss functions. There are also several approaches that learn graphs from data for specific problems. Some studies[15, 14] learn specialized graphs that perform clustering or semisupervised learning. Others[30, 32] focus specifically on learning probabilistic graphical model structure. Most of the proposed approaches are highly problemspecific and do not scale well to large graphs.
3 Method
In this section we describe the general design of PRODIGE and its training procedure.
3.1 Learnable Graph Metric Spaces
Consider an undirected weighted graph $G(V,E,w)$, where $V=\{{v}_{0},{v}_{1},\mathrm{\dots},{v}_{n}\}$ is a set of vertices, corresponding to data items, and $E=\{{e}_{0},{e}_{1},\mathrm{\dots},{e}_{m}\}$ is a set of edges, ${e}_{i}=e({v}_{i}^{source},{v}_{i}^{target})$. We use ${w}_{\theta}({e}_{i})$ to denote nonnegative edge weights, which are learnable parameters of our model.
Our model includes another set of learnable parameters that correspond to the probabilities of edges in the graph. Specifically, for each edge ${e}_{i}$, we define a Bernoulli random variable ${b}_{i}\sim {p}_{\theta}({b}_{i})$ that indicates whether an edge is present in the graph $G$. For simplicity, we assume that all random variables ${b}_{i}$ in our model are independent. In this case the joint probability of all edges in the graph can be written as $p(G)={\prod}_{i=0}^{m}{p}_{\theta}({b}_{i})$.
The expected distance between any two vertices ${v}_{i}$ and ${v}_{j}$ can now be expressed as a sum of edge weights along the shortest path:
$$\underset{G\sim p(G)}{E}{d}_{G}({v}_{i},{v}_{j})=\underset{G\sim p(G)}{E}\mathit{\hspace{1em}}\underset{\pi \in {\mathrm{\Pi}}_{G}({v}_{i},{v}_{j})}{\mathrm{arg}\mathrm{min}}\sum _{{e}_{i}\in \pi}{w}_{\theta}({e}_{i})$$  (1) 
Here ${\mathrm{\Pi}}_{G}({v}_{i},{v}_{j})$ denotes the set of all paths from ${v}_{i}$ to ${v}_{j}$ over the edges of $G$, or more formally, ${\mathrm{\Pi}}_{G}({v}_{i},{v}_{j})=\{\pi :(e({v}_{i},{v}_{\mathrm{\dots}}),\mathrm{\dots},e({v}_{\mathrm{\dots}},{v}_{j}))\}$. For a given $G$, the shortest path can be found exactly using Dijkstra’s algorithm. Generally speaking, a shortest path is not guaranteed to exist, e.g., if $G$ is disconnected; in this case we define ${d}_{G}({v}_{i},{v}_{j})$ to be equal to a sufficiently large constant.
The parameters of our model must satisfy two constraints: ${w}_{\theta}({e}_{i})\ge 0$ for weights and $0\le {p}_{\theta}(e)\le 1$ for probabilities. We avoid constrained optimization by defining ${w}_{\theta}({e}_{i})$ as $softplus({\theta}_{w,i})$[31] and ${p}_{\theta}({b}_{i})$ as $\sigma ({\theta}_{b,i})$.
This model can be trained by minimizing an arbitrary differentiable objective function with respect to parameters $\theta =\{{\theta}_{w},{\theta}_{b}\}$ directly by gradient descent. We explore several problemspecific objective functions in Section 4.
3.2 Sparsity
In order to learn a compact representation, we encourage the algorithm to learn sparse graphs by imposing additional regularization on $p(G)$. Namely, we employ a recent sparsification technique proposed in [19]. Denoting the taskspecific loss function as $L(G,\theta )$, the regularized training objective can be written as:
$$\mathcal{R}(\theta )=\underset{G\sim p(G)}{E}\left[L(G,\theta )\right]+\lambda \cdot \frac{1}{E}\sum _{i=1}^{E}{p}_{\theta}({b}_{i}=1)$$  (2) 
Intuitively, the term $\frac{1}{E}{\sum}_{i=1}^{E}{p}_{\theta}({b}_{i}=1)$ penalizes the number of edges being used, on average. It effectively encourages sparsity by forcing the edge probabilities to decay over time. For instance, if a certain edge has no effect on the main objective $L(G,\theta )$ (e.g., the edge never occurs on any shortest path), the optimal probability of that edge being present is exactly zero. The regularization coefficient $\lambda $ affects the “zeal” of edge pruning, with larger values of $\lambda $ corresponding to greater sparsity of the learned graph.
We minimize this regularized objective function (2) by stochastic gradient descent. The gradients ${\nabla}_{{\theta}_{w}}\mathcal{R}$ are straightforward to compute using existing autograd packages, such as TensorFlow or PyTorch. The gradients ${\nabla}_{{\theta}_{b}}\mathcal{R}$ are, however, more tricky and require the logderivative trick[8]:
$${\nabla}_{{\theta}_{b}}\mathcal{R}=\underset{G\sim p(G)}{E}[L(G,\theta )\cdot {\nabla}_{{\theta}_{b}}\mathrm{log}p(G)]+\lambda \cdot \frac{1}{E}\sum _{i=1}^{E}{\nabla}_{{\theta}_{b}}{p}_{\theta}({b}_{i}=1)$$  (3) 
In practice, we can use a MonteCarlo estimate of gradient (3). However, the variance of this estimate can be too large to be used in practical optimization. To reduce the variance, we use the fact that the optimal path usually contains only a tiny subset of all possible edges. More formally, if the objective function only depends on a subgraph $\widehat{G}\in G:L(\widehat{G},\theta )=L(G,\theta )$, then we can integrate out all edges from $G\setminus \widehat{G}$:
$$\underset{G\sim p(G)}{E}L(G,\theta )\cdot {\nabla}_{{\theta}_{b}}\mathrm{log}p(G)=\underset{\widehat{G}\sim p(\widehat{G})}{E}L(\widehat{G},\theta )\cdot {\nabla}_{{\theta}_{b}}\mathrm{log}p(\widehat{G})$$  (4) 
The expression (4) allows for an efficient training procedure that only samples edges that are required by Dijkstra’s algorithm. More specifically, on every iteration the pathfinding algorithm selects a vertex ${v}_{i}$ and explores its neighbors by sampling ${b}_{i}\sim {p}_{\theta}({b}_{i})$ corresponding to the edges that are potentially connected to ${v}_{i}$. In this case, the size of $\widehat{G}$ is proportional to the number of iterations of Dijkstra’s algorithm, which is typically lower than the number of vertices in the original graph $G$.
Finally, once the training procedure converges, most edges in the obtained graph are nearly deterministic: $$. We make this graph exactly deterministic by keeping only the edges with probability greater or equal to $0.5$.
3.3 Scalability
As the total number of edges $E$ in a complete graph grows quadratically with the number of vertices $V$, memory consumption during PRODIGE training on large datasets can be infeasible. To reduce memory requirements, we explicitly restrict a subset of possible edges and learn probabilities only for edges from this subset. The subset of possible edges is constructed by a simple heuristic described below.
First, we add an edge from each data item to $k$ most similar items in terms of problemspecific similarity in the original data space. Second, we also add $r$ random edges between uniformly chosen source and destination vertices. Overall, the size of the constructed subset scales linearly with the number of vertices $V$, which makes training feasible for largescale datasets. In our experiments, we observe that increasing the number of possible edges improves the model performance until some saturation point, typically with $32100$ edges per vertex.
Finally, we observed that the choice of an optimization algorithm is crucial for the convergence speed of our method. In particular, we use SparseAdam[17] as it significantly outperformed other sparse SGD alternatives in our experiments.
4 Applications
In this section, we experimentally evaluate the proposed PRODIGE model on several practical tasks and compare it to taskspecific baselines.
4.1 Distancepreserving compression
Distancepreserving compression learns a compact representation of highdimensional data that preserves the pairwise distances from the original highdimensional space. The learned representations can then be used in practical applications, e.g., data visualization.
Objective. We optimize the squared error between pairwise distances in the original and compressed spaces. Since PRODIGE graphs are stochastic, we minimize this objective in expectation over edge probabilities:
$$\underset{G\sim p(G)}{E}L(G,\theta )=\underset{G\sim p(G)}{E}\frac{1}{{N}^{2}}\sum _{i=0}^{N}\sum _{j=0}^{N}(\parallel {x}_{i}{x}_{j}{{\parallel}_{2}{d}_{G}({v}_{i},{v}_{j}))}^{2}$$  (5) 
In the formula above, ${x}_{i},{x}_{j}\in X$ are objects in the original highdimensional Euclidean space and ${v}_{i},{v}_{j}\in V$ are the corresponding graph vertices. Note that the objective (5) resembles the wellknown Multidimensional Scaling algorithm[18].
However, the objective (5) does not account for the graph size in the learned model. This can likely lead to a trivial solution since the PRODIGE graph can simply memorize ${w}_{\theta}(e({v}_{i},{v}_{j}))=\parallel {x}_{i}{x}_{j}{\parallel}_{2}$. Therefore, we use the sparsification technique described earlier in Section 3.2. We also employ the initialization heuristic from Section 3.3 to speed up training. Namely, we start with 64 edges per vertex, half of which are links to the nearest neighbors and the other half are random edges.
Experimental setup. We experiment with three publicly available datasets:

•
MNIST10k: $N={10}^{4}$ images from the test set of the MNIST dataset, represented as $784$dimensional vectors;

•
GLOVE10k: top$N={10}^{4}$ most frequent words, represented as $300$dimensional pretrained^{1}^{1} 1 We used a pretrained gensim model "glovewikigigaword300"; see https://radimrehurek.com/gensim/downloader.html for details GloVe[26] vectors;

•
CelebA10K: $128$dimensional embeddings of $N={10}^{4}$ random face photographs from the CelebA dataset, produced by deep CNN^{2}^{2} 2 The dataset was obtained by running face_recognition package on CelebA images and uniformly sampling ${10}^{4}$ face embeddings, see https://github.com/ageitgey/face_recognition.
In these experiments, we aim to preserve the Euclidean distances (5) for all datasets. Note, however, that any distance function can be used in PRODIGE.
Method  #parameters  #parameters  MNIST10k  GLOVE10k  CelebA10k 

per instance  total  
$\le $ 4 parameters per instance  
PRODIGE  $3.92\pm 0.02$  39.2k  0.00938  0.03289  0.00374 
MDS  4  40k  0.05414  0.13142  0.01678 
Poincare MDS  4  40k  0.04683  0.11386  0.01649 
PCA  4  40k  0.30418  0.84912  0.09078 
$\le $ 8 parameters per instance  
PRODIGE  $7.65\pm 0.14$  76.5k  0.00886  0.02856  0.00367 
MDS  8  80k  0.01857  0.05584  0.00621 
Poincare MDS  8  80k  0.01503  0.04839  0.00619 
PCA  8  80k  0.16237  0.62424  0.05298 
We compare our method with three baselines, which construct vectorial embeddings:
 •

•
Poincare MDS is a version of MDS that maps data into Poincare Ball. This method approximates the original distance with a hyperbolic distance between learned vector embeddings: ${d}_{h}({x}_{i},{x}_{j})=arccosh\left(1+2\frac{\parallel {x}_{i}{x}_{j}{\parallel}_{2}^{2}}{(1\parallel {x}_{i}{\parallel}_{2}^{2})\cdot (1\parallel {x}_{j}{\parallel}_{2}^{2})}\right)$

•
PCA. Principal Component Analysis is the most popular techinique for data compression. We include this method for sanity check.
For all the methods, we compare the performance given the same memory budget. Namely, we investigate two operating points, corresponding to four and eight 32bit numbers per data item. For embeddingbased baselines, this corresponds to 4dimensional and 8dimensional embeddings, respectively. As for PRODIGE, it requires a total of $N+2E$ parameters where $N=V$ is a number of objects and $E$ is the number of edges with ${p}_{\theta}({b}_{i})\ge 0.5$. The learned graph is represented as a sequence of edges ordered by the source vertex, and each edge is stored as a tuple of target vertex (int32) and weight (float32). We tune the regularization coefficient $\lambda $ to achieve the overall memory consumption close to the considered operating points. The distance approximation errors for all methods are reported in Table 1, which illustrates that PRODIGE outperforms the embeddingbased counterparts by a large margin. These results confirm that the proposed graph representations are more efficient in capturing the underlying data geometry compared to vectorial embeddings. We also verify the robustness of our training procedure by running several experiments with different random initializations and different initial numbers of neighbors. Figure 4.1 shows the learning curves of PRODIGE under various conditions for GLOVE10K dataset and four numbers/vertex budget. While these results exhibit some variability due to optimization stochasticity, the overall training procedure is stable and robust.
Qualitative results. To illustrate the graphs obtained by learning the PRODIGE model, we train it on a toy dataset containing 100 randomly sampled MNIST images of five classes. We start the training from a full graph, which contains $4950$ edges, and increase $\lambda $ till the moment when only $5\%$ of edges are preserved. The tSNE[22] plot of the obtained graph, based on ${d}_{G}(\cdot ,\cdot )$ distances, is shown on Figure 1.
Figure 1 reveals several properties of the PRODIGE graphs. First, the number of edges per vertex is very uneven, with a large fraction of edges belonging to a few high degree vertices. We assume that these "popular" vertices play the role of "traffic hubs" for pathfinding in the graph. The shortest path between distant vertices is likely to begin by reaching the nearest "hub", then travel over the "long" edges to the hub that "contains" the target vertex, after which it would follow the short edges to reach the target itself.
Another important observation is that nonhub vertices tend to have only a few edges. We conjecture that this is the key to the memoryefficiency of our method. Effectively, the PRODIGE graph represents most of the data items by their relation to one or several "landmark" vertices, corresponding to graph hubs. Interestingly, this representation resembles the topology of humanmade transport networks with few dedicated hubs and a large number of peripheral nodes. We plot the distribution of vertex degrees on Figure 4.1, which resembles powerlaw distribution, demonstrating "scalefree" property of PRODIGE graphs.
Sanity check. In this experiment we also verify that PRODIGE is able to reconstruct graphs from datapoints, which mutual distances are obtained as shortest paths in some graph. Namely, we generate connected Erdős–Rényi graphs with 1025 vertices with edge probability $p=0.25$ and edge weights sampled from uniform $U(0,1)$ distribution. We then train PRODIGE to reconstruct these graphs given pairwise distances between vertices. Out of $100$ random graphs, in $91$ cases PRODIGE was able to reconstruct all edges that affected shortest paths and all 100 runs the resulting graph had distances approximation error below ${10}^{3}$.
4.2 Collaborative Filtering
In the next series of experiments, we investigate the usage of PRODIGE in the collaborative filtering task, which is a key component of modern recommendation systems.
Consider a sparse binary useritem matrix $F\in {\{0,1\}}^{m\times n}$ that represents the preferences of $m$ users over $n$ items. ${F}_{ij}=1$ means that the $j$th item is relevant for the $i$th user. ${F}_{ij}=0$ means that the $j$th item is not relevant for the $i$th user or the relevance information is absent. The recommendation task is to extract the most relevant items for the particular user. A common performance measure for this task is [email protected] This metric measures how frequently there is at least one relevant item among topk recommendations suggested by the algorithm. Note that in these experiments we consider the simplest setting where user preferences are binary. All experiments are performed on the Pinterest dataset[7].
Objective. Intuitively, we want our model to rank the relevant items above the nonrelevant ones. We achieve this via maximizing the following objective:
$$\underset{G\sim p(G)}{E}L(G,\theta )=\underset{G\sim p(G)}{E}\mathit{\hspace{1em}}\underset{{u}_{i},{x}^{+},{x}^{}}{E}\mathrm{log}\frac{{e}^{{d}_{G}({u}_{i},{x}^{+})}}{{e}^{{d}_{G}({u}_{i},{x}^{+})}+{\sum}_{{x}^{}}{e}^{{d}_{G}({u}_{i},{x}^{})}}$$  (6) 
For each user ${u}_{i}$, this objective enforces the positive items ${x}^{+}$ to be closer in terms of ${d}_{G}(\cdot ,\cdot )$ than the negative items ${x}^{}$. We sample positive items ${x}^{+}$ uniformly from the training items that are relevant to ${u}_{i}$. In turn, ${x}^{}$ are sampled uniformly from all items. In practice, we only need to run Dijkstra’s algorithm once per each user ${u}_{i}$ to calculate the distances to all positive and negative items.
Similarly to the previous section, we speed up the training stage by considering only a small subset of edges. Namely, we build an initial graph using $F$ as adjacency matrix and add extra edges that connect the nearest users (useruser edges) and the nearest items (itemitem edges) in terms of cosine distance between the corresponding rows/columns of $F$.
For this task, we compare the following methods:

•
PRODIGEnormal: PRODIGE method as described above; we restrict a set of possible edges to include 16 useruser and itemitem edges and all relevant useritem edges available in the training data;

•
PRODIGEbipartite: a version of PRODIGE that is allowed to learn only edges that connect users to items, counting approximately 30 edges per item. The useruser and the itemitem edges are prohibited;

•
PRODIGErandom: a version of our model that is initialized with completely random edges. All useruser, itemitem, and useritem edges are sampled uniformly. In total, we sample 50 edges per each user and item;

•
SVD: truncated Singular Value Decomposition of useritem matrix^{3}^{3} 3 we use the Implicit package for both ALS and SVD baselines https://github.com/benfred/implicit;

•
ALS: Alternating Least Squares method for implicit feedback[12];

•
Metric Learning: metric learning approach that learns the user/item embeddings in the Euclidean space and optimizes the same objective function (6) with Euclidean distance between user and item embeddings; all other training parameters are shared with PRODIGE.
The comparison results for two memory budgets are presented in Table 2. It illustrates that our PRODIGE model achieves better overall recommendation quality compared to embeddingbased methods. Interestingly, the bipartite graph performs nearly as well as the taskspecific heuristic with nearest edges. In contrast, starting from a random set of edges results in poor performance, which indicates the importance of initial edges choice.
Method  PRODIGE  SVD  ALS  Metric Learning  
normal  bipartite  random  
$\le $ 4 parameters per user/item  
[email protected]  0.50213  0.48533  0.35587  0.33365  0.37005  0.45898 
[email protected]  0.66192  0.64250  0.57492  0.49619  0.51815  0.60079 
$\le $ 8 parameters per user/item  
[email protected]  0.59921  0.57659  0.50113  0.5107  0.48617  0.54728 
[email protected]  0.79021  0.77980  0.73485  0.70489  0.70075  0.74891 
4.3 Sentiment classification
As another application, we consider a simple text classification problem: the algorithm takes a sequence of tokens (words) $({x}_{0},{x}_{1},\mathrm{\dots},{x}_{T})$ as input and predicts a single class label $y$. This problem arises in a wide range of tasks, such as sentiment classification or spam detection
In this experiment, we explore the potential applications of learnable weighted graphs as an intermediate data representation within a multilayer model. Our goal here is to learn graph edges endtoend using the gradients from the subsequent layers. To make the whole pipeline fully differentiable, we design a projection of data items, represented as graph vertices, into feature vectors digestible by subsequent convolutional or dense layers.
Namely, a data item ${v}_{i}$ is represented as a vector of distances to $K$ predefined "anchor" vertices:
$$em{b}_{G}({v}_{i})=\u27e8{d}_{G}({v}_{i},{v}_{0}^{a}),{d}_{G}({v}_{i},{v}_{1}^{a}),\mathrm{\dots},{d}_{G}({v}_{i},{v}_{K}^{a})\u27e9$$  (7) 
In practice, we add the "anchor" vertices $\{{v}_{0}^{a},\mathrm{\dots},{v}_{K}^{a}\}$ to a graph before training and connect them to random vertices. Note that the "anchor" vertices do not correspond to any data object. The architecture used in this experiment is schematically presented on Figure 2.
Intuitively, the usage of PRODIGE in architecture from Figure 2 can be treated as a generalization of vectorial embedding layer. For instance, if a graph contains only the edges between vertices and anchors, this is equivalent to embedding $emb({v}_{i})=\u27e8{w}_{\theta}(e({v}_{i},{v}_{0}^{a})),{w}_{\theta}(e({v}_{i},{v}_{1}^{a})),\mathrm{\dots},{w}_{\theta}(e({v}_{i},{v}_{K}^{a}))\u27e9$ with $O(n\cdot K)$ trainable parameters. However, in practice, our regularized PRODIGE model learns a more compact graph by using vertexvertex edges to encode words via their relation to other words.
Model and objective. We train a simple sentiment classification model with four consecutive layers: embedding layer, onedimensional convolutional layer with 32 output filters, followed by a global max pooling layer, a ReLU nonlinearity and a final dense layer that predicts class logits. Indeed, this model is smaller than the stateoftheart models for sentiment classification and should be considered only as a proof of concept. We minimize the crossentropy objective by backpropagation, computing gradients w.r.t all trainable parameters including graph edges $\{{\theta}_{w},{\theta}_{b}\}$.
Experimental setup. We evaluate our model on the IMDB benchmark [21], a popular dataset for text sentiment binary classification. The data is split into training and test sets, each containing $N=25,000$ text instances. For simplicity, we only consider $M=32,000$ most frequent tokens.
We compare our model with embeddingbased baselines, which follow the same architecture from Figure 2, but with the standard embedding layer instead of PRODIGE. All embeddings are initialized with pretrained word representations and then finetuned with the subsequent layers by backpropagation. As pretrained embeddings, we use GloVe vectors^{4}^{4} 4 We used https://pypi.org/project/gensim/ to train GloVe model with recommended parameters trained on $25,000$ texts from the training set and select vectors corresponding to the $M$ most frequent tokens. In PRODIGE, the model graph is pretrained by distancepreserving compression of the GloVe embeddings, as described in Section 4.1. In order to encode the "anchor" objects, we explicitly add $K$ synthetic objects to the data by running Kmeans clustering and compress the resulting $N+K$ objects by minimizing the objective (5).
Representation  PRODIGE  GloVe  
100d  100d  18d  
Accuracy  0.8443  0.8483  0.8028 
Model size  2.16 MB  12.24 MB  2.20 MB 
For each model, we report test accuracy and the total model size. The results in Table 3 illustrate that PRODIGE learns a model with nearly the same quality as full $100$dimensional embeddings with much smaller size. On the other hand, PRODIGE significantly outperforms its vectorial counterpart of the same model size.
5 Discussion
In this work, we introduce PRODIGE, a new method constructing representations of finite datasets. The method represents the dataset as a weighted graph with a shortest distance path metric, which is able to capture geometry of any finite metric space. Due to minimal inductive bias, PRODIGE captures the essential information from data more effectively compared to embeddingbased representation learning methods. The graphs are learned endtoend via minimizing any differentiable objective function, defined by the target task. We empirically confirm the advantage of PRODIGE in several machine learning problems and publish its source code online.
Acknowledgements: Authors would like to thank David Talbot for his numerous suggestions on paper readability. We would also like to thank anonymous metareviewer for his insightful feedback. Vage Egiazaryan was supported by the Russian Science Foundation under Grant 194104109.
References
 [1] (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.
 [2] (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), Cited by: §1.
 [3] (2018) VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §2.
 [4] (2018) Representation tradeoffs for hyperbolic embeddings. arXiv preprint arXiv:1804.03329. Cited by: §1, §2.
 [5] (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, Cited by: §1, §2.
 [6] (2011) From points to nodes: inverse graph embedding through a lagrangian formulation. In CAIP, Cited by: §2.
 [7] (2015) Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4274–4282. Cited by: §4.2.
 [8] (1990) Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM. Cited by: §3.2.
 [9] (2018) Learning mixedcurvature representations in product spaces. Cited by: §1, §2.
 [10] (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, Cited by: §1.
 [11] (2015) Finite metric spaces and their embedding into lebesgue spaces. Cited by: §1.
 [12] (2008) Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. Cited by: 5th item.
 [13] (2018) Anonymous walk embeddings. ArXiv abs/1805.11921. Cited by: §2.
 [14] (2019) Robust graph learning from noisy data. IEEE transactions on cybernetics. Cited by: §2.
 [15] (2016) Adaptive edge weighting for graphbased learning algorithms. Machine Learning 106, pp. 307–335. Cited by: §2.
 [16] (2019) Hyperbolic image embeddings. arXiv preprint arXiv:1904.02239. Cited by: §2.
 [17] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §3.3.
 [18] (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), pp. 115–129. Cited by: 1st item, §4.1.
 [19] (2018) Learning sparse neural networks through l_0 regularization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §3.2.
 [20] (2004) Distinctive image features from scaleinvariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
 [21] (201106) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Cited by: §4.3.
 [22] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 1, §4.1.
 [23] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, Cited by: §1, §2.
 [24] (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, Cited by: §1, §2.
 [25] (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §2.
 [26] (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Cited by: §1, §2, 2nd item.
 [27] (2018) Poincar$\backslash $’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §2.
 [28] (2018) Hyperbolic recommender systems. arXiv preprint arXiv:1809.01703. Cited by: §2.
 [29] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
 [30] (2019) DAGgnn: dag structure learning with graph neural networks. CoRR abs/1904.10098. Cited by: §2.
 [31] (2015) Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 1217, 2015, pp. 1–4. Cited by: §3.1.
 [32] (2018) DAGs with no tears: continuous optimization for structure learning. In NeurIPS, Cited by: §2.