Abstract
Similarity graphs are an active research direction for the nearest neighborsearch (NNS) problem. New algorithms for similarity graph construction arecontinuously being proposed and analyzed by both theoreticians andpractitioners. However, existing construction algorithms are mostly based onheuristics and do not explicitly maximize the target performance measure, i.e.,search recall. Therefore, at the moment it is not clear whether the performanceof similarity graphs has plateaued or more effective graphs can be constructedwith more theoretically grounded methods. In this paper, we introduce a newprincipled algorithm, based on adjacency matrix optimization, which explicitlymaximizes search efficiency. Namely, we propose a probabilistic model of asimilarity graph defined in terms of its edge probabilities and show how tolearn these probabilities from data as a reinforcement learning task. Asconfirmed by experiments, the proposed construction method can be used torefine the stateoftheart similarity graphs, achieving higher recall ratesfor the same number of distance computations. Furthermore, we analyze thelearned graphs and reveal the structural properties that are responsible formore efficient search.
Quick Read (beta)
Towards Similarity Graphs Constructed
by Deep Reinforcement Learning
Abstract
Similarity graphs are an active research direction for the nearest neighbor search (NNS) problem. New algorithms for similarity graph construction are continuously being proposed and analyzed by both theoreticians and practitioners. However, existing construction algorithms are mostly based on heuristics and do not explicitly maximize the target performance measure, i.e., search recall. Therefore, at the moment it is not clear whether the performance of similarity graphs has plateaued or more effective graphs can be constructed with more theoretically grounded methods. In this paper, we introduce a new principled algorithm, based on adjacency matrix optimization, which explicitly maximizes search efficiency. Namely, we propose a probabilistic model of a similarity graph defined in terms of its edge probabilities and show how to learn these probabilities from data as a reinforcement learning task. As confirmed by experiments, the proposed construction method can be used to refine the stateoftheart similarity graphs, achieving higher recall rates for the same number of distance computations. Furthermore, we analyze the learned graphs and reveal the structural properties that are responsible for more efficient search.
1 Introduction
In this paper, we address nearest neighbor search (NNS), a longstanding problem, arising in a large number of machine learning applications, such as recommender services, information retrieval, and others. The NNS problem is formalized as follows. Given the database $D=\{{v}_{1},\mathrm{\dots},{v}_{N}\}\subset {\mathbb{R}}^{d}$ and a query $q\in {\mathbb{R}}^{d}$, one needs to find the datapoint $v\in D$ that is closest to the query in terms of some distance (e.g. Euclidean). As the sizes of databases $D$ in arising practical tasks are constantly increasing, the efficiency and the scalability of NNS become crucial.
Thus, the problem of efficient NNS receives much attention from the machine learning community. Wellknown established approaches, based on partition trees (Bentley, 1975; Sproull, 1991; McCartinLim et al., 2012; Dasgupta & Freund, 2008; Dasgupta & Sinha, 2013) and localitysensitive hashing (LSH) (Indyk & Motwani, 1998; Datar et al., 2004; Andoni & Indyk, 2008; Andoni et al., 2015) have been developed by ML researchers for decades and provide both decent practical performance and theoretical guarantees. Recently, similarity graph methods (Navarro, 2002; Malkov & Yashunin, 2016; Fu & Cai, 2016; Fu et al., 2017), were shown to outperform treebased and LSHbased techniques (Aumüller et al., 2017). These methods represent the database as a graph, and at the search stage, a query traverses the graph via beam search. While these methods do not have full theoretical support yet, their exceptional practical performance has shifted the research attention to the development of new approaches based on this paradigm.
Due to the great importance of the NNS problem, new algorithms for similarity graphs construction are being proposed and analyzed by both theoreticians (Laarhoven, 2018) and practitioners (Fu & Cai, 2016; Malkov & Yashunin, 2016; Fu et al., 2017; Iwasaki & Miyazaki, 2018). Most of these works, however, propose new heuristicsbased procedures, which do not explicitly optimize search efficiency. Moreover, different methods often achieve superior results only on a subset of datasets, which implies that the proposed heuristics are not universally applicable.
In this work, we introduce a new method for similarity graph construction that explicitly maximizes the search efficiency via optimization of the graph adjacency matrix. Specifically, we define a probabilistic model of a similarity graph in terms of its edge probabilities. Then we learn these probabilities from data, maximizing the search efficiency for a large set of training queries. It appears that this task could be naturally treated as a reinforcement learning problem. As a result, the proposed algorithm produces a graph that outperforms graphs constructed via heuristic approaches.
To sum up, the contributions of this paper are as follows:

1.
We develop a new algorithm for similarity graph construction that explicitly optimizes search efficiency. To the best of our knowledge, all existing methods are based on heuristics that can have limited niches of applicability.

2.
By experiments on common benchmarks, we show that the proposed algorithm can be used to refine stateoftheart similarity graphs, which allows to achieve higher recall rates under the same number of distance computations. We also analyse the learned graphs and investigate the properties that cause the gains.

3.
We demonstrate a novel practical largescale application of the reinforcement learning machinery that explicitly optimizes the quality of similarity graphs with millions of edges.
The rest of the paper is organized as follows. First, we discuss relevant prior works. Then we describe the proposed RLbased graph construction algorithm, empirically analyze it and confirm its advantage over heuristicbased methods. The source code of our algorithm and experiments are available online^{1}^{1} 1 https://github.com/dbaranchuk/nnsmeetsdeeprl
2 Related work
Here we briefly review the ideas from the prior works that are relevant to our approach.
Nearest neighbor search techniques. The existing NNS approaches mostly fall into three research directions. Methods from the first direction, based on partition trees (Bentley, 1975; Sproull, 1991; McCartinLim et al., 2012; Dasgupta & Freund, 2008; Dasgupta & Sinha, 2013), hierarchically split the search space into a large number of regions, corresponding to tree leaves, and the query visits only a limited number of promising regions when searching. Second, localitysensitive hashing methods (Indyk & Motwani, 1998; Datar et al., 2004; Andoni & Indyk, 2008; Andoni et al., 2015) map the database points into several buckets using several hash functions such that the probability of collision is much higher for nearby points than for points that are further apart. At the search stage, a query is also hashed, and distances to all the points from the corresponding buckets are evaluated. The third direction of similarity graphs (Navarro, 2002; Malkov & Yashunin, 2016; Fu & Cai, 2016; Fu et al., 2017; Iwasaki & Miyazaki, 2018) represents the database as a directed graph, and on the search stage, a query traverses the graph via beam search. The empirical performance of similarity graphs was shown to be much higher compared to LSHbased and treebased methods (Yu. A. Malkov, 2016). In more details, the typical search process in similarity graphs performs as follows. The database is organized in a graph, where each vertex corresponds to some datapoint, and the vertices, corresponding to the neighboring datapoints, are connected by edges. The search algorithm picks a start vertex (random or predefined) and iteratively explores the graph from it. On each iteration, the query tries to improve its position by moving to a vertex from a candidate pool that is closest to the query. The routing process stops when there are no closer vertices in the pool.
Similarity graphs construction procedures. Several recent works developing similarity graph methods typically differ in graph construction procedures, based on different heuristics. For instance, the recent HNSW algorithm (Yu. A. Malkov, 2016) performs consecutive insertions of database items into the graph structure. This procedure provides longrange edges for efficient graph navigation. Moreover, an additional structure of a nested hierarchy of layers is proposed for further speedup. Another recent graph, NSG (Fu et al., 2017), employs a knearest neighbor graph as an initial graph structure, then performs the search procedure with each node being a query, connects the node with vertices visited during the search and selects edges following the pruning strategy. The recently proposed graphbased method NGTonng (Iwasaki & Miyazaki, 2018) provides a set of heuristics for graph construction and finds optimal indegrees and outdegrees for a specific precision region.
Both (Fu et al., 2017; Iwasaki & Miyazaki, 2018) report that the advantage of different graphs is revealed on different datasets, which implies the limitations of the heuristics in use. Instead, our approach aims to learn the graph from data, explicitly optimizing the search efficiency.
Learning of data structures. The recent line of works (Kraska et al., 2018, 2019) proposes to use machine learning methods instead of the traditional database indices, such as Btrees and Bloom Filters. While being related, these methods are not directly applied to the construction of similarity graphs, which we address in this paper.
Reinforcement learning for discrete structures. Our approach is partially inspired by the recent RL success for structure learning in different machine learning pipelines. Probably, the most wellknown usecase is the learning of DNN structure (Zoph & Le, 2016). Another related recent work is DeepPath (Xiong et al., 2017) that employs RL to learn structures of the knowledge graphs. In this paper, we demonstrate that RL is also a natural fit for the problem of similarity graph construction for NNS.
3 Method
In this section, we describe our approach for similarity graph construction based on reinforcement learning.
3.1 Similarity graph construction as an optimization problem
First, we introduce a probabilistic model of a similarity graph. Our model defines a probability of a graph as a joint probability of individual edges. Each edge is modelled as an independent Bernoulli random variable ${b}_{i}\sim Bern({p}_{i})$ that determines whether this edge should exist in the graph. Therefore, the probability of the graph $G$ is a product of probabilities of all edges: $P(G)=P({b}_{1},{b}_{2},\mathrm{\dots},{b}_{n})={\prod}_{i}{p}_{i}^{{b}_{i}}{(1{p}_{i})}^{1{b}_{i}}$. Our goal then is to maximize the following objective:
$$\begin{array}{c}\hfill {P}^{*}(G)=\underset{P(G)}{\mathrm{arg}\mathrm{max}}{E}_{q\sim p(q)}{E}_{G\sim P(G)}\mathcal{R}(G,q)\hfill \\ \hfill \mathcal{R}(G,q)=\mathcal{F}(Accuracy(G,q),Complexity(G,q))\hfill \end{array}$$  (1) 
Here ${E}_{q\sim p(q)}$ denotes the expectation over the query distribution. $Accuracy(G,q)$ and $Complexity(G,q)$ are responsible for high search recall and high search efficiency respectively. $\mathcal{F}(\cdot ,\cdot )$ plays a role of an "acquisition" function that combines both $Accuracy(G,q)$ and $Complexity(G,q)$ into one scalar value. We elaborate on each of these terms in the next section.
By solving the optimization problem (1), we find the edge probabilities $\{{p}_{1},\mathrm{\dots},{p}_{n}\}$ that maximize the accuracy and minimize the search complexity in expectation over graphs $G\sim P(G)$.
Finally, we obtain a deterministic graph^{2}^{2} 2 Here we exploit the fact that our optimization problem (1) has a deterministic solution i.e. a graph where $p\in \{0,1\}$. This property holds because our problem is equivalent to a Markov Decision Process. It can be proven that all MDPs have a deterministic optimal policy (Puterman, 1994). as ${G}^{*}={\mathrm{arg}\mathrm{max}}_{G}{P}^{*}(G)$, which corresponds to keeping the edges with $p\ge 0.5$ and omitting the edges with $$. This graph then can be used for NNS with one of the standard search algorithms.
For largescale problems, optimizing over a quadratic number of edges is infeasible. In this case we take some initial similarity graph $\widehat{G}$ and refine it, pruning its edges via optimization (1) over edges presented in $\widehat{G}$. We obtain a subgraph ${G}^{*}\subseteq \widehat{G}$ that is more efficient in terms of nearest neighbor search performance. For smallscale datasets, we aim to optimize the complete graph since it is guaranteed to contain the optimal one.
3.2 Markov Decision Process
Now let us formulate the optimization problem (1) as a Markov Decision Process (MDP). We consider the initial graph $\widehat{G}$ and search algorithm as the environment $\mathcal{E}$. An MDP agent interacts with the environment using two available actions $a$: "remove" or "keep" an edge. The environment state $s=(q,{v}_{i},{v}_{adj},V,H)$ consists of a query $q$, current vertex ${v}_{i}$, its adjacent vertices ${v}_{adj}$, already visited vertices $V$ and a heap of candidates $H$. The transition function $\mathcal{T}$ represents the search algorithm. In our work we exploit the standard HNSW search algorithm (Yu. A. Malkov, 2016) and incorporate the RL agent in the loop, see Algorithm 1.
Sessions. We introduce a session $\tau $ as a search procedure for a single query $q$. On each step, the search procedure visits a vertex and updates the state $s$. The agent obtains $s$ and decides which edges are available from that vertex. In turn, the search algorithm processes the kept edges and picks the next vertex. After the search terminates, the agent obtains a reward $\mathcal{R}$ for the entire session.
Reward function. Our reward function $\mathcal{R}(\tau )$ combines two components: accuracy and complexity of the search process. The accuracy for one session is an indicator $I[\tau ]$ if the actual nearest neighbor is found. This term encourages the agent to maximize search recall. For instance, it may exclude edges that cause the search procedure to get stuck in poor local optima. The second component measures the search complexity as a number of distance computations $DCS$ during one session. This term effectively encourages the agent to prune irrelevant edges.
We define the reward function as:
$$\mathcal{R}(\tau )=I[\tau ]\cdot \mathrm{max}(DC{S}_{max}DCS,1)$$  (2) 
where $DC{S}_{max}$ is a distance computation budget, which is set to restrict the search complexity for each query. Intuitively, we want the agent to find the actual nearest neighbor and then to reduce the complexity without an accuracy drop. If the nearest neighbor is not found then $R(\tau )=0$ regardless of $DCS$, otherwise the agent obtains higher reward for more computationally efficient sessions. With lower $DC{S}_{max}$ values, the agent is more prone to sacrificing accuracy on some queries for more efficient search on others. We also observe that the value of $DC{S}_{max}$ affects the algorithm convergence by changing the "sharpness" of the objective function. In practice, we tune this parameter empirically based on average vertex degree and the desired recall region.
3.3 Policy Network Architecture
In our method, the agent is a policy network that predicts edge probabilities. For simplicity, we use a feedforward architecture that processes each edge individually: ${\pi}_{\theta}(bs)={\prod}_{i}^{n}{\pi}_{\theta}({b}_{i}{x}_{i}(s))$. The network receives an edge, represented as a concatenation of source and target vertices ${x}_{i}(s)=[{v}_{source},{v}_{target}]$, as input and predicts its probability. The network itself consists of two linear layers with ELU activations followed by another linear layer with sigmoid nonlinearity. While more powerful network architectures can be used (e.g., Graph Convolutional Networks (Kipf & Welling, 2016)), they are typically inapplicable in the largescale scenario due to GPU memory constraints and long training time.
3.4 Policy optimization
We can now apply policybased RL to directly optimize the expected reward (2). The overall scheme of our approach is presented in Figure 1.
Among policybased methods such as REINFORCE (Williams & Peng, 1991), PPO (Schulman et al., 2017), ACKTR (Wu et al., 2017), etc, we have found that TRPO (Schulman et al., 2015) provides the fastest convergence and the highest reward values. The main practical drawback of TRPO is that it requires a large number of sessions to perform an accurate natural gradient update. However, in our case, each session requires only a single run of the search algorithm, hence we can efficiently sample a large number of search trajectories in parallel.
We also adapt two common policy optimization tricks for our setting. First, we use reward baselines to speed up convergence by reducing gradient variance. Our algorithm maintains an individual baseline for each training query as a moving average of observed rewards for that query. Second, we facilitate exploration by adding policy entropy to the training objective. This longstanding technique (Williams & Peng, 1991) discourages the agent from premature convergence to a suboptimal deterministic policy.
3.5 Training on large databases
For largescale problems, our approach becomes limited by the number of edges it can consider. Namely, if the agent is allowed to draw edges between arbitrary vertices, the number of edges grows quadratically with the database size. Hence it is practically infeasible to train such an agent on the complete graph built upon large databases typical for NNS problems. To mitigate this issue, we limit the agent to a predefined subset of edges. Namely, we construct one of the existing heuristicsbased graphs and allow our agent to select edges from that graph. In all our experiments, the initial graph vertex degrees are equal or slightly larger than in baseline graphs which, by themselves, appear to have redundant edges.
To speedup training, we also employ the following heuristic. If an agent’s prediction for a particular edge is overconfident for a long period during training, we consider this edge deterministic and do not optimize over it. This heuristic reduces optimization problem complexity and allows the agent to concentrate on adjusting predictions for more uncertain edges. As a possible research direction, it is interesting to develop an effective method for expanding the search space, e.g. by interactively adding new edges during training.
4 Experiments
In this section, we evaluate and analyze graphs constructed by our approach. First, we visualize a toy graph, learned for a small dataset, and describe several interesting observations. Then, we provide an experimental comparison of the constructed graphs with stateoftheart graphbased methods and analyse the emerging properties of the learned graphs.
4.1 Toy example
We visualize graphs constructed by our method on a small subset of the MNIST8x8 (Dua & Graff, 2017) dataset. Namely, we sample $100$ $64$dimensional vectors for the base set and use the entire dataset as training queries.
In this experiment, we use greedy search as the search algorithm: we choose the next vertex as the closest one among neighbors of the current query position. The RL agent starts training from a complete graph, and we set $DC{S}_{max}=150$. After the training we manually remove edges that are never used by the search algorithm. Such edges affect neither recall nor $DCS$ and only bring noise to degree distribution.
At convergence, the constructed graph achieves $0.957$ recall. On average, the search algorithm requires $22$ $DCS$ and terminates after $2.85$ graph hops. The average outdegree is reduced from $99$ to $2.45$.
Finally, we project the base vectors onto 2D plane, using tSNE (Maaten & Hinton, 2008) and illustrate the graph structure on Figure 2 (left). The vertex colors correspond to the MNIST class labels. The $start$ vertex is the entry point for the search algorithm — a medoid of the base set.
In order to analyze the properties of the learned similarity graph, we run the search algorithm for all queries and aggregate the following statistics: (1) how often each node is visited and (2) for what number of queries each node is an actual nearest neighbor. Below we highlight several observations from Figure 2 and explain our intuition about graphs appropriate for the NNS problem.

•
We observe an appearance of few nodes, socalled hubs, that provide efficient navigation over the graph. Each MNIST class contains one or two hubs. The $start$ node is connected to hubs for fast navigation to a query region. At the first step, the search navigates to one of the hubs. Then, it either finds the answer or transits to another local hub, which is closer to an actual nearest neighbor. The existence of hubs allows the search algorithm to reach answers just in two or three hops. At the same time, the average node outdegree is low, as the number of hubs is small.

•
Most vertices do not participate in graph navigation. The search algorithm mostly visits such a vertex if it is the actual nearest neighbor for a given query. These vertices are usually terminal, hence their outdegrees are almost zeros.
SIFT100K  DEEP100K 
SIFT1M  DEEP1M  GloVe1M 
Additionally, we plot the outdegree histogram for the constructed graph on Figure 2 (right). Most vertices have zero outdegrees and only few have a degree greater than six. This roughly resembles the truncated powerlaw distribution over outdegrees. Interestingly, all highoutdegree nodes are hubs. A prior work(Malkov & Ponomarenko, 2016) investigates the properties of graphs with truncated powerlaw degree distribution for the NNS problem and shows that such degree distribution is likely to provide an efficient search. In our approach, such properties emerge naturally from search performance optimization over the complete graph.
SIFT100K  DEEP100K 
4.2 Datasets
We evaluate the proposed approach on three publicly available datasets described below:

1.
SIFT100K dataset (Jégou et al., 2011) is sampled from one million $128$dimensional SIFT descriptors. We consider 100,000 learn vectors and remained base vectors as train queries. Note, the original learn set contains test queries, therefore we manually remove them. We take 20,000 datapoints for validation. The holdout 10,000 query vectors are used for evaluation.

2.
SIFT1M dataset contains one million SIFT descriptors sampled from SIFT1B (Jégou et al., 2011). We sample one million train queries from the learn set. Again, we leave 20,000 queries for validation and evaluate on original 10,000 holdout queries.

3.
DEEP100K dataset (Babenko & Lempitsky, 2016) is a subset of one billion of $96$dimensional CNNproduced feature vectors of natural images from the Web. The base set contains 100,000 vectors. We sample 200,000 train and 20,000 validation queries from the learn set. For evaluation, we use the original 10,000 queries.

4.
DEEP1M dataset is the same as DEEP100K where the base and learn sets are extended to one million datapoints.

5.
GloVe1M dataset (Pennington et al., 2014) contains 2.2 millions of 300dimensional word embeddings trained on Common Crawl. We split them on one million base set, one million learn set, 20,000 queries for validation and 10,000 queries for evaluation.
4.3 Search performance evaluation
Here we compare the graphs constructed with our method to stateoftheart baselines on the SIFT100K and DEEP100K datasets. Namely, we evaluate:

•
HNSW: one of the current stateoftheart graphs proposed in (Yu. A. Malkov, 2016); this approach exploits the nested hierarchy of navigable smallworld graphs constructed on the database subsets to obtain a start vertex.

•
NSW: the bottom layer of HNSW graph. The search starts from the fixed vertex for all queries.

•
NSG: another stateoftheart similarity graph method (Fu et al., 2017); NSG does not use any additional indexing structure and starts the search from the database medoid.

•
NSW Ours: RL approach applied to the NSW graph.

•
NSG Ours: RL approach applied to the NSG graph.
We tune hyperparameters for all baseline graphs in each recall region. All parameters for the graphs listed above are reported in the supplementary materials. Note that the proposed RLbased approach can also be applied to graphs with additional indexing structures (e.g., HNSW, NGT). However, we leave it beyond the scope of our evaluation.
As a primary performance measure, we use $Recall\mathrm{@}1$, which is calculated as a rate of queries for which the search algorithm successfully finds the actual nearest neighbor.
Most millionscale experiments converge within $\sim 24$ hours on a single GPU GeForce 1080Ti. We rerun the RL approach at least five times for each graph and draw its mean and standard deviation. The plots for the SIFT100K and DEEP100K datasets are presented on Figure 3, and the plots for SIFT1M, DEEP1M, GloVe1M are presented on Figure 4.
For all datasets, we observe a consistent improvement over corresponding baseline graphs. We highlight several key observations below:

•
On SIFT100K, the optimized NSG consistently outperforms all other evaluated graphs. In particular, we observe up to $\sim 1\%$ improvement compared to the topperforming NSG baseline. On DEEP100K, the optimized NSW graph also outperforms HNSW/NSW graph by up to $\sim 1\%$. For $99+\%$ $Recall\mathrm{@}1$ region, the gains become insignificant. Note that NSG graphs are superior on SIFT data, while NSW/HNSW performs better on the DEEP100K dataset. This is a weakness of heuristicbased similarity graphs: different heuristics are more appropriate for different data. Our RLbased approach may significantly reduce the gap in performance. E.g., while NSG outperforms NSW by up to $\sim 2.5\%$ on SIFT100K, the maximum gap between optimized graphs reduces to $\sim 0.4\%$. On DEEP100K, NSW/HNSW outperforms NSG by up to $\sim 3.0\%$, while, for NSW Ours and NSG Ours, the maximum difference is $\sim 1.3\%$.

•
On all datasets, we observe more significant gains for lower $Recall\mathrm{@}1$ regions. While our hypothesis that the RL approach mainly influences the navigation properties of similarity graphs, this observation is consistent with the fact that navigation properties lose their value if the search algorithm’s heap size increases.

•
On all datasets and all $Recall\mathrm{@}1$ regions, the optimized NSW is superior or equal to HNSW, which exploits an additional indexing structure for better navigation. Therefore, the nested hierarchy of graphs is redundant and can be replaced by its bottom layer with improved navigation properties.

•
On the most challenging dataset, GloVe1M, NSW/HNSW graphs demonstrate much worse performance due to the high intrinsic dimensionality of the word embeddings. For this dataset, our approach mitigates the issues of NSW/HNSW graphs and outperforms baselines by $\sim 0.4\%$ at $88\%$ $Recall\mathrm{@}1$ point.
4.4 Graph properties analysis
In this section, we analyze the emerging properties of graphs learned by the proposed algorithm. Our primary hypothesis is that the advantage of our method in terms of search efficiency is attributed to its ability to learn more specialized roles for graph vertices, similarly to what we observed in the toy experiment.
In order to test this hypothesis, we study the statistical properties of frequently visited vertices. In both NSW and NSG graphs, there is a small subset of vertices that help the search procedure to navigate during the first few graph hops. Hence, an improvement in these vertices may have a substantial effect on the overall search efficiency.
We consider 40 vertices that are the most frequently visited by the search algorithm. For each vertex, we count its number of visits over ${10}^{5}$ training queries. The obtained numbers of visits for baseline graphs and graphs produced by our method are presented on Figure 5.
Figure 5 clearly indicates that graphs produced by our method have a more peaky distribution over vertex visit frequencies compared to both baselines. In other words, directly optimizing graph for nearest neighbor search produces more specialized navigation vertices.
Interestingly, our RL approach can also learn a new starting vertex, see NSW Ours on SIFT100K in Figure 5. The agent omits all edges in initial starting vertex except one. Hence, for every query the search procedure goes to the new starting node by performing only one distance computation. Note that “peakyness” of the distributions from Figure 5 correlates with relative performance of heuristicsbased graphs on different datasets. For instance, on SIFT100K NSG has more pronounced hubs and outperforms NSW on this dataset, see Figure 3. In contrast, on DEEP100K, NSW has more “peaky” distribution compared to NSG and provides superior search performance.
We conjecture that our algorithm is better able to learn the edges for the navigation vertices, achieving more accurate routing, compared to heuristicsbased counterparts.
4.5 Comparison to heuristic methods
In this experiment, we evaluate our approach against one of the heuristic methods, which can be used for similarity graph improvement.
Here, we consider magnitudebased pruning, where the weights for each edge are computed as follows:
$${w}_{ij}=\frac{n\mathrm{\_}visited\mathrm{\_}{e}_{ij}+\lambda}{n\mathrm{\_}visited\mathrm{\_}{v}_{i}+\lambda \cdot outdegree({v}_{i})}$$  (3) 
, where $n\mathrm{\_}visited\mathrm{\_}{e}_{ij}$ and $n\mathrm{\_}visited\mathrm{\_}{v}_{i}$ correspond to visitation frequencies for edge ${e}_{ij}$ and vertex ${v}_{i}$ respectively. We compute these frequencies by running search procedure on training queries. The only hyperparameter $\lambda $ plays a smoothing role, discouraging radical pruning of rarely visited vertices. In our experiments we always use $\lambda =0.1$. Then, we tune a weight threshold to maximize performance for validation queries. Finally, all edges whose weights are below the threshold are pruned.
We compare our RLbased approach and magnitude pruning applied to the NSW graph on DEEP1M, see Figure 6. Our method outperforms magnitude pruning across all distance computation budgets. Finally, we apply magnitude pruning to the graph constructed by RL and observe that it also slightly improves the performance.
5 Conclusion
In this paper, we introduce a new algorithm for similarity graph construction that explicitly optimizes an adjacency matrix, maximizing the search quality for a large set of training queries. The algorithm defines a probabilistic model of the graph in terms of its edge probabilities and then learns these probabilities in a reinforcement learning scenario. We show that the proposed approach allows to improve the performance of similarity graphs constructed by heuristics.
References
 Andoni & Indyk (2008) Andoni, A. and Indyk, P. Nearoptimal hashing algorithms for near neighbor problem in high dimension. Communications of the ACM, 51(1):117–122, 2008.
 Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I. P., and Schmidt, L. Practical and optimal LSH for angular distance. In NIPS, 2015.
 Aumüller et al. (2017) Aumüller, M., Bernhardsson, E., and Faithfull, A. Annbenchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In SISAP, 2017.
 Babenko & Lempitsky (2016) Babenko, A. and Lempitsky, V. S. Efficient indexing of billionscale datasets of deep descriptors. In CVPR, 2016.
 Bentley (1975) Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM, 18, 1975.
 Dasgupta & Freund (2008) Dasgupta, S. and Freund, Y. Random projection trees and low dimensional manifolds. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 1720, 2008, pp. 537–546, 2008.
 Dasgupta & Sinha (2013) Dasgupta, S. and Sinha, K. Randomized partition trees for exact nearest neighbor search. In Conference on Learning Theory, pp. 317–337, 2013.
 Datar et al. (2004) Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 811, 2004, pp. 253–262, 2004.
 Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Fu & Cai (2016) Fu, C. and Cai, D. Efanna: An extremely fast approximate nearest neighbor search algorithm based on knn graph. arXiv preprint arXiv:1609.07228, 2016.
 Fu et al. (2017) Fu, C., Xiang, C., Wang, C., and Cai, D. Fast approximate nearest neighbor search with the navigating spreadingout graph. arXiv preprint arXiv:1707.00143, 2017.
 Indyk & Motwani (1998) Indyk, P. and Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 2326, 1998, pp. 604–613, 1998.
 Iwasaki & Miyazaki (2018) Iwasaki, M. and Miyazaki, D. Optimization of indexing based on knearest neighbor graph for proximity search in highdimensional data. arXiv preprint arXiv:1810.07355, 2018.
 Jégou et al. (2011) Jégou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. TPAMI, 33(1), 2011.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kraska et al. (2018) Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, 2018.
 Kraska et al. (2019) Kraska, T., Alizadeh, M., Beutel, A., Chi, E. H., Ding, J., Kristo, A., Leclerc, G., Madden, S., Mao, H., and Nathan, V. Sagedb: A learned database system. 2019.
 Laarhoven (2018) Laarhoven, T. Graphbased timespace tradeoffs for approximate near neighbors. In 34th International Symposium on Computational Geometry, SoCG 2018, June 1114, 2018, Budapest, Hungary, 2018.
 Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Malkov & Ponomarenko (2016) Malkov, Y. A. and Ponomarenko, A. Growing homophilic networks are natural navigable small worlds. PloS one, 2016.
 Malkov & Yashunin (2016) Malkov, Y. A. and Yashunin, D. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
 McCartinLim et al. (2012) McCartinLim, M., McGregor, A., and Wang, R. Approximate principal direction trees. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012, 2012.
 Navarro (2002) Navarro, G. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(1):28–46, 2002.
 Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org/anthology/D141162.
 Puterman (1994) Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In ICML, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 Sproull (1991) Sproull, R. F. Refinements to nearestneighbor searching in kdimensional trees. Algorithmica, 6, 1991.
 Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. 1991.
 Wu et al. (2017) Wu, Y., Mansimov, E., Liao, S., Grosse, R. B., and Ba, J. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. CoRR, abs/1708.05144, 2017.
 Xiong et al. (2017) Xiong, W., Hoang, T., and Wang, W. Y. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
 Yu. A. Malkov (2016) Yu. A. Malkov, D. A. Y. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
 Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
A.1 Hyperparameters SIFT100K
NSW  HNSW  NSG  NSW Ours  NSG Ours  
$M$  12  12    12   
$e{f}_{construction}$  500  300    300   
$R$      24    24 
$K$      200    200 
$e{f}_{search}$        10  10 
$DC{S}_{max}$        1200  1500 
${C}_{entropy}$        0.01  0.001 
A.2 Hyperparameters SIFT1M
NSW  HNSW  NSW Ours  
$M$  14  14  14 
$e{f}_{construction}$  500  500  500 
$e{f}_{search}$      12 
$DC{S}_{max}$      1500 
${C}_{entropy}$      0.01 
A.3 Hyperparameters DEEP100K
NSW  HNSW  NSG  NSW Ours  NSG Ours  
$M$  12  12    12   
$e{f}_{construction}$  300  300    300   
$R$      24    24 
$K$      200    200 
$e{f}_{search}$        10  10 
$DC{S}_{max}$        1000  1500 
${C}_{entropy}$        0.01  0.001 
A.4 Hyperparameters DEEP1M
NSW  HNSW  NSW Ours  
$M$  14  14  14 
$e{f}_{construction}$  500  500  500 
$e{f}_{search}$      12 
$DC{S}_{max}$      1500 
${C}_{entropy}$      0.01 
A.5 Hyperparameters GloVe1M
NSW  HNSW  NSW Ours  
$M$  20  28  20 
$e{f}_{construction}$  2000  2000  2000 
$e{f}_{search}$      5 
$DC{S}_{max}$      1000 
${C}_{entropy}$      0.01 