Towards Similarity Graphs Constructed by Deep Reinforcement Learning

  • 2020-02-13 18:59:16
  • Dmitry Baranchuk, Artem Babenko
  • 0

Abstract

Similarity graphs are an active research direction for the nearest neighborsearch (NNS) problem. New algorithms for similarity graph construction arecontinuously being proposed and analyzed by both theoreticians andpractitioners. However, existing construction algorithms are mostly based onheuristics and do not explicitly maximize the target performance measure, i.e.,search recall. Therefore, at the moment it is not clear whether the performanceof similarity graphs has plateaued or more effective graphs can be constructedwith more theoretically grounded methods. In this paper, we introduce a newprincipled algorithm, based on adjacency matrix optimization, which explicitlymaximizes search efficiency. Namely, we propose a probabilistic model of asimilarity graph defined in terms of its edge probabilities and show how tolearn these probabilities from data as a reinforcement learning task. Asconfirmed by experiments, the proposed construction method can be used torefine the state-of-the-art similarity graphs, achieving higher recall ratesfor the same number of distance computations. Furthermore, we analyze thelearned graphs and reveal the structural properties that are responsible formore efficient search.

 

Quick Read (beta)

Towards Similarity Graphs Constructed
by Deep Reinforcement Learning

Dmitry Baranchuk    Artem Babenko
Abstract

Similarity graphs are an active research direction for the nearest neighbor search (NNS) problem. New algorithms for similarity graph construction are continuously being proposed and analyzed by both theoreticians and practitioners. However, existing construction algorithms are mostly based on heuristics and do not explicitly maximize the target performance measure, i.e., search recall. Therefore, at the moment it is not clear whether the performance of similarity graphs has plateaued or more effective graphs can be constructed with more theoretically grounded methods. In this paper, we introduce a new principled algorithm, based on adjacency matrix optimization, which explicitly maximizes search efficiency. Namely, we propose a probabilistic model of a similarity graph defined in terms of its edge probabilities and show how to learn these probabilities from data as a reinforcement learning task. As confirmed by experiments, the proposed construction method can be used to refine the state-of-the-art similarity graphs, achieving higher recall rates for the same number of distance computations. Furthermore, we analyze the learned graphs and reveal the structural properties that are responsible for more efficient search.


1 Introduction

In this paper, we address nearest neighbor search (NNS), a long-standing problem, arising in a large number of machine learning applications, such as recommender services, information retrieval, and others. The NNS problem is formalized as follows. Given the database D={v1,,vN}d and a query qd, one needs to find the datapoint vD that is closest to the query in terms of some distance (e.g. Euclidean). As the sizes of databases |D| in arising practical tasks are constantly increasing, the efficiency and the scalability of NNS become crucial.

Thus, the problem of efficient NNS receives much attention from the machine learning community. Well-known established approaches, based on partition trees (Bentley, 1975; Sproull, 1991; McCartin-Lim et al., 2012; Dasgupta & Freund, 2008; Dasgupta & Sinha, 2013) and locality-sensitive hashing (LSH) (Indyk & Motwani, 1998; Datar et al., 2004; Andoni & Indyk, 2008; Andoni et al., 2015) have been developed by ML researchers for decades and provide both decent practical performance and theoretical guarantees. Recently, similarity graph methods (Navarro, 2002; Malkov & Yashunin, 2016; Fu & Cai, 2016; Fu et al., 2017), were shown to outperform tree-based and LSH-based techniques (Aumüller et al., 2017). These methods represent the database as a graph, and at the search stage, a query traverses the graph via beam search. While these methods do not have full theoretical support yet, their exceptional practical performance has shifted the research attention to the development of new approaches based on this paradigm.

Due to the great importance of the NNS problem, new algorithms for similarity graphs construction are being proposed and analyzed by both theoreticians (Laarhoven, 2018) and practitioners (Fu & Cai, 2016; Malkov & Yashunin, 2016; Fu et al., 2017; Iwasaki & Miyazaki, 2018). Most of these works, however, propose new heuristics-based procedures, which do not explicitly optimize search efficiency. Moreover, different methods often achieve superior results only on a subset of datasets, which implies that the proposed heuristics are not universally applicable.

In this work, we introduce a new method for similarity graph construction that explicitly maximizes the search efficiency via optimization of the graph adjacency matrix. Specifically, we define a probabilistic model of a similarity graph in terms of its edge probabilities. Then we learn these probabilities from data, maximizing the search efficiency for a large set of training queries. It appears that this task could be naturally treated as a reinforcement learning problem. As a result, the proposed algorithm produces a graph that outperforms graphs constructed via heuristic approaches.

To sum up, the contributions of this paper are as follows:

  1. 1.

    We develop a new algorithm for similarity graph construction that explicitly optimizes search efficiency. To the best of our knowledge, all existing methods are based on heuristics that can have limited niches of applicability.

  2. 2.

    By experiments on common benchmarks, we show that the proposed algorithm can be used to refine state-of-the-art similarity graphs, which allows to achieve higher recall rates under the same number of distance computations. We also analyse the learned graphs and investigate the properties that cause the gains.

  3. 3.

    We demonstrate a novel practical large-scale application of the reinforcement learning machinery that explicitly optimizes the quality of similarity graphs with millions of edges.

The rest of the paper is organized as follows. First, we discuss relevant prior works. Then we describe the proposed RL-based graph construction algorithm, empirically analyze it and confirm its advantage over heuristic-based methods. The source code of our algorithm and experiments are available online11 1 https://github.com/dbaranchuk/nns-meets-deep-rl

2 Related work

Here we briefly review the ideas from the prior works that are relevant to our approach.

Nearest neighbor search techniques. The existing NNS approaches mostly fall into three research directions. Methods from the first direction, based on partition trees (Bentley, 1975; Sproull, 1991; McCartin-Lim et al., 2012; Dasgupta & Freund, 2008; Dasgupta & Sinha, 2013), hierarchically split the search space into a large number of regions, corresponding to tree leaves, and the query visits only a limited number of promising regions when searching. Second, locality-sensitive hashing methods (Indyk & Motwani, 1998; Datar et al., 2004; Andoni & Indyk, 2008; Andoni et al., 2015) map the database points into several buckets using several hash functions such that the probability of collision is much higher for nearby points than for points that are further apart. At the search stage, a query is also hashed, and distances to all the points from the corresponding buckets are evaluated. The third direction of similarity graphs (Navarro, 2002; Malkov & Yashunin, 2016; Fu & Cai, 2016; Fu et al., 2017; Iwasaki & Miyazaki, 2018) represents the database as a directed graph, and on the search stage, a query traverses the graph via beam search. The empirical performance of similarity graphs was shown to be much higher compared to LSH-based and tree-based methods (Yu. A. Malkov, 2016). In more details, the typical search process in similarity graphs performs as follows. The database is organized in a graph, where each vertex corresponds to some datapoint, and the vertices, corresponding to the neighboring datapoints, are connected by edges. The search algorithm picks a start vertex (random or predefined) and iteratively explores the graph from it. On each iteration, the query tries to improve its position by moving to a vertex from a candidate pool that is closest to the query. The routing process stops when there are no closer vertices in the pool.

Similarity graphs construction procedures. Several recent works developing similarity graph methods typically differ in graph construction procedures, based on different heuristics. For instance, the recent HNSW algorithm (Yu. A. Malkov, 2016) performs consecutive insertions of database items into the graph structure. This procedure provides long-range edges for efficient graph navigation. Moreover, an additional structure of a nested hierarchy of layers is proposed for further speedup. Another recent graph, NSG (Fu et al., 2017), employs a k-nearest neighbor graph as an initial graph structure, then performs the search procedure with each node being a query, connects the node with vertices visited during the search and selects edges following the pruning strategy. The recently proposed graph-based method NGT-onng (Iwasaki & Miyazaki, 2018) provides a set of heuristics for graph construction and finds optimal indegrees and outdegrees for a specific precision region.

Both (Fu et al., 2017; Iwasaki & Miyazaki, 2018) report that the advantage of different graphs is revealed on different datasets, which implies the limitations of the heuristics in use. Instead, our approach aims to learn the graph from data, explicitly optimizing the search efficiency.

Learning of data structures. The recent line of works (Kraska et al., 2018, 2019) proposes to use machine learning methods instead of the traditional database indices, such as B-trees and Bloom Filters. While being related, these methods are not directly applied to the construction of similarity graphs, which we address in this paper.

Reinforcement learning for discrete structures. Our approach is partially inspired by the recent RL success for structure learning in different machine learning pipelines. Probably, the most well-known use-case is the learning of DNN structure (Zoph & Le, 2016). Another related recent work is DeepPath (Xiong et al., 2017) that employs RL to learn structures of the knowledge graphs. In this paper, we demonstrate that RL is also a natural fit for the problem of similarity graph construction for NNS.

3 Method

In this section, we describe our approach for similarity graph construction based on reinforcement learning.

3.1 Similarity graph construction as an optimization problem

First, we introduce a probabilistic model of a similarity graph. Our model defines a probability of a graph as a joint probability of individual edges. Each edge is modelled as an independent Bernoulli random variable biBern(pi) that determines whether this edge should exist in the graph. Therefore, the probability of the graph G is a product of probabilities of all edges: P(G)=P(b1,b2,,bn)=ipibi(1-pi)1-bi. Our goal then is to maximize the following objective:

P*(G)=argmaxP(G)Eqp(q)EGP(G)(G,q)(G,q)=(Accuracy(G,q),Complexity(G,q)) (1)

Here Eqp(q) denotes the expectation over the query distribution. Accuracy(G,q) and Complexity(G,q) are responsible for high search recall and high search efficiency respectively. (,) plays a role of an "acquisition" function that combines both Accuracy(G,q) and Complexity(G,q) into one scalar value. We elaborate on each of these terms in the next section.

By solving the optimization problem (1), we find the edge probabilities {p1,,pn} that maximize the accuracy and minimize the search complexity in expectation over graphs GP(G).

Finally, we obtain a deterministic graph22 2 Here we exploit the fact that our optimization problem (1) has a deterministic solution i.e. a graph where p{0,1}. This property holds because our problem is equivalent to a Markov Decision Process. It can be proven that all MDPs have a deterministic optimal policy (Puterman, 1994). as G*=argmaxGP*(G), which corresponds to keeping the edges with p0.5 and omitting the edges with p<0.5. This graph then can be used for NNS with one of the standard search algorithms.

For large-scale problems, optimizing over a quadratic number of edges is infeasible. In this case we take some initial similarity graph G^ and refine it, pruning its edges via optimization (1) over edges presented in G^. We obtain a subgraph G*G^ that is more efficient in terms of nearest neighbor search performance. For small-scale datasets, we aim to optimize the complete graph since it is guaranteed to contain the optimal one.

3.2 Markov Decision Process

Now let us formulate the optimization problem (1) as a Markov Decision Process (MDP). We consider the initial graph G^ and search algorithm as the environment . An MDP agent interacts with the environment using two available actions a: "remove" or "keep" an edge. The environment state s=(q,vi,vadj,V,H) consists of a query q, current vertex vi, its adjacent vertices vadj, already visited vertices V and a heap of candidates H. The transition function 𝒯 represents the search algorithm. In our work we exploit the standard HNSW search algorithm (Yu. A. Malkov, 2016) and incorporate the RL agent in the loop, see Algorithm 1.

\SetAlgoNoLine
\algorithmcfname 1 The nearest neighbor search algorithm with incorporated RL agent.
\KwData

graph G^, query q, initial vertex v0, output size k

Initialization:

V{v0} // a set of visited vertices

H{v0:d(v0,q)} // a heap of candidates

TopK{v0:d(v0,q)} // a heap of top-k results

\While

not should_stop /* i-th search step */

vi extract nearest element from H to q

vadj get adjacent vertices of vi

s(q,vi,vadj,V,H) // collect environment state

v^adj Agent(s) // predict what connections to keep

\For

v^v^adjV

VAdd(V,v^)

HInsert(H,v^,d(v^,q))

TopKUpdate(TopK,k,v^,d(v^,q))

\Return

TopK

Sessions. We introduce a session τ as a search procedure for a single query q. On each step, the search procedure visits a vertex and updates the state s. The agent obtains s and decides which edges are available from that vertex. In turn, the search algorithm processes the kept edges and picks the next vertex. After the search terminates, the agent obtains a reward for the entire session.

Reward function. Our reward function (τ) combines two components: accuracy and complexity of the search process. The accuracy for one session is an indicator I[τ] if the actual nearest neighbor is found. This term encourages the agent to maximize search recall. For instance, it may exclude edges that cause the search procedure to get stuck in poor local optima. The second component measures the search complexity as a number of distance computations DCS during one session. This term effectively encourages the agent to prune irrelevant edges.

We define the reward function as:

(τ)=I[τ]max(DCSmax-DCS,1) (2)

where DCSmax is a distance computation budget, which is set to restrict the search complexity for each query. Intuitively, we want the agent to find the actual nearest neighbor and then to reduce the complexity without an accuracy drop. If the nearest neighbor is not found then R(τ)=0 regardless of DCS, otherwise the agent obtains higher reward for more computationally efficient sessions. With lower DCSmax values, the agent is more prone to sacrificing accuracy on some queries for more efficient search on others. We also observe that the value of DCSmax affects the algorithm convergence by changing the "sharpness" of the objective function. In practice, we tune this parameter empirically based on average vertex degree and the desired recall region.

3.3 Policy Network Architecture

In our method, the agent is a policy network that predicts edge probabilities. For simplicity, we use a feed-forward architecture that processes each edge individually: πθ(b|s)=inπθ(bi|xi(s)). The network receives an edge, represented as a concatenation of source and target vertices xi(s)=[vsource,vtarget], as input and predicts its probability. The network itself consists of two linear layers with ELU activations followed by another linear layer with sigmoid non-linearity. While more powerful network architectures can be used (e.g., Graph Convolutional Networks (Kipf & Welling, 2016)), they are typically inapplicable in the large-scale scenario due to GPU memory constraints and long training time.

3.4 Policy optimization

We can now apply policy-based RL to directly optimize the expected reward (2). The overall scheme of our approach is presented in Figure 1.

Among policy-based methods such as REINFORCE (Williams & Peng, 1991), PPO (Schulman et al., 2017), ACKTR (Wu et al., 2017), etc, we have found that TRPO (Schulman et al., 2015) provides the fastest convergence and the highest reward values. The main practical drawback of TRPO is that it requires a large number of sessions to perform an accurate natural gradient update. However, in our case, each session requires only a single run of the search algorithm, hence we can efficiently sample a large number of search trajectories in parallel.

We also adapt two common policy optimization tricks for our setting. First, we use reward baselines to speed up convergence by reducing gradient variance. Our algorithm maintains an individual baseline for each training query as a moving average of observed rewards for that query. Second, we facilitate exploration by adding policy entropy to the training objective. This long-standing technique (Williams & Peng, 1991) discourages the agent from premature convergence to a suboptimal deterministic policy.

Figure 1: Overview of the proposed RL scheme for graph construction. It is presented as a communication between the environment and agent. Left: the environment is a similarity graph equipped with a search algorithm. On each step, the search algorithm visits a node and updates the environment state. Right: the agent obtains the state and uses policy network to predict which outgoing edges to preserve. Then, the search procedure processes the kept edges and transits to the next node. When the search terminates, the agent obtains a total reward for the entire session.
Figure 2: Left: the constructed graph on 100 vectors from the MNIST8x8 dataset. The optimization is performed over a complete graph. Colors correspond to the MNIST class labels. The nodes providing efficient graph navigation (hubs) are denoted by large sizes. Each MNIST class contains up to two hubs. Right: the outdegree histogram for the obtained graph. Most vertices have zero outdegree and only few with degrees greater than six. All high outdegree nodes correspond to hubs.

3.5 Training on large databases

For large-scale problems, our approach becomes limited by the number of edges it can consider. Namely, if the agent is allowed to draw edges between arbitrary vertices, the number of edges grows quadratically with the database size. Hence it is practically infeasible to train such an agent on the complete graph built upon large databases typical for NNS problems. To mitigate this issue, we limit the agent to a predefined subset of edges. Namely, we construct one of the existing heuristics-based graphs and allow our agent to select edges from that graph. In all our experiments, the initial graph vertex degrees are equal or slightly larger than in baseline graphs which, by themselves, appear to have redundant edges.

To speed-up training, we also employ the following heuristic. If an agent’s prediction for a particular edge is overconfident for a long period during training, we consider this edge deterministic and do not optimize over it. This heuristic reduces optimization problem complexity and allows the agent to concentrate on adjusting predictions for more uncertain edges. As a possible research direction, it is interesting to develop an effective method for expanding the search space, e.g. by interactively adding new edges during training.

4 Experiments

In this section, we evaluate and analyze graphs constructed by our approach. First, we visualize a toy graph, learned for a small dataset, and describe several interesting observations. Then, we provide an experimental comparison of the constructed graphs with state-of-the-art graph-based methods and analyse the emerging properties of the learned graphs.

4.1 Toy example

We visualize graphs constructed by our method on a small subset of the MNIST8x8 (Dua & Graff, 2017) dataset. Namely, we sample 100 64-dimensional vectors for the base set and use the entire dataset as training queries.

In this experiment, we use greedy search as the search algorithm: we choose the next vertex as the closest one among neighbors of the current query position. The RL agent starts training from a complete graph, and we set DCSmax=150. After the training we manually remove edges that are never used by the search algorithm. Such edges affect neither recall nor DCS and only bring noise to degree distribution.

At convergence, the constructed graph achieves 0.957 recall. On average, the search algorithm requires 22 DCS and terminates after 2.85 graph hops. The average outdegree is reduced from 99 to 2.45.

Finally, we project the base vectors onto 2D plane, using tSNE (Maaten & Hinton, 2008) and illustrate the graph structure on Figure 2 (left). The vertex colors correspond to the MNIST class labels. The start vertex is the entry point for the search algorithm — a medoid of the base set.

In order to analyze the properties of the learned similarity graph, we run the search algorithm for all queries and aggregate the following statistics: (1) how often each node is visited and (2) for what number of queries each node is an actual nearest neighbor. Below we highlight several observations from Figure 2 and explain our intuition about graphs appropriate for the NNS problem.

  • We observe an appearance of few nodes, so-called hubs, that provide efficient navigation over the graph. Each MNIST class contains one or two hubs. The start node is connected to hubs for fast navigation to a query region. At the first step, the search navigates to one of the hubs. Then, it either finds the answer or transits to another local hub, which is closer to an actual nearest neighbor. The existence of hubs allows the search algorithm to reach answers just in two or three hops. At the same time, the average node outdegree is low, as the number of hubs is small.

  • Most vertices do not participate in graph navigation. The search algorithm mostly visits such a vertex if it is the actual nearest neighbor for a given query. These vertices are usually terminal, hence their outdegrees are almost zeros.

SIFT100K DEEP100K
Figure 3: Recall@1 values as functions of distance computations DCS on the SIFT100K and DEEP100K datasets.
SIFT1M DEEP1M GloVe1M
Figure 4: Recall@1 values as functions of distance computations DCS on the SIFT1M, DEEP1M and GloVe1M datasets.

Additionally, we plot the outdegree histogram for the constructed graph on Figure 2 (right). Most vertices have zero outdegrees and only few have a degree greater than six. This roughly resembles the truncated power-law distribution over outdegrees. Interestingly, all high-outdegree nodes are hubs. A prior work(Malkov & Ponomarenko, 2016) investigates the properties of graphs with truncated power-law degree distribution for the NNS problem and shows that such degree distribution is likely to provide an efficient search. In our approach, such properties emerge naturally from search performance optimization over the complete graph.

SIFT100K DEEP100K
Figure 5: Search visitation frequencies for 40 most visited vertices, sorted by frequency (except for start vertex). The top row represents the baseline graphs; the bottom row depicts their counterparts optimized by our method.

4.2 Datasets

We evaluate the proposed approach on three publicly available datasets described below:

  1. 1.

    SIFT100K dataset (Jégou et al., 2011) is sampled from one million 128-dimensional SIFT descriptors. We consider 100,000 learn vectors and remained base vectors as train queries. Note, the original learn set contains test queries, therefore we manually remove them. We take 20,000 datapoints for validation. The hold-out 10,000 query vectors are used for evaluation.

  2. 2.

    SIFT1M dataset contains one million SIFT descriptors sampled from SIFT1B (Jégou et al., 2011). We sample one million train queries from the learn set. Again, we leave 20,000 queries for validation and evaluate on original 10,000 hold-out queries.

  3. 3.

    DEEP100K dataset (Babenko & Lempitsky, 2016) is a subset of one billion of 96-dimensional CNN-produced feature vectors of natural images from the Web. The base set contains 100,000 vectors. We sample 200,000 train and 20,000 validation queries from the learn set. For evaluation, we use the original 10,000 queries.

  4. 4.

    DEEP1M dataset is the same as DEEP100K where the base and learn sets are extended to one million datapoints.

  5. 5.

    GloVe1M dataset (Pennington et al., 2014) contains 2.2 millions of 300-dimensional word embeddings trained on Common Crawl. We split them on one million base set, one million learn set, 20,000 queries for validation and 10,000 queries for evaluation.

4.3 Search performance evaluation

Here we compare the graphs constructed with our method to state-of-the-art baselines on the SIFT100K and DEEP100K datasets. Namely, we evaluate:

  • HNSW: one of the current state-of-the-art graphs proposed in (Yu. A. Malkov, 2016); this approach exploits the nested hierarchy of navigable small-world graphs constructed on the database subsets to obtain a start vertex.

  • NSW: the bottom layer of HNSW graph. The search starts from the fixed vertex for all queries.

  • NSG: another state-of-the-art similarity graph method (Fu et al., 2017); NSG does not use any additional indexing structure and starts the search from the database medoid.

  • NSW Ours: RL approach applied to the NSW graph.

  • NSG Ours: RL approach applied to the NSG graph.

We tune hyperparameters for all baseline graphs in each recall region. All parameters for the graphs listed above are reported in the supplementary materials. Note that the proposed RL-based approach can also be applied to graphs with additional indexing structures (e.g., HNSW, NGT). However, we leave it beyond the scope of our evaluation.

As a primary performance measure, we use Recall@1, which is calculated as a rate of queries for which the search algorithm successfully finds the actual nearest neighbor.

Most million-scale experiments converge within 24 hours on a single GPU GeForce 1080Ti. We rerun the RL approach at least five times for each graph and draw its mean and standard deviation. The plots for the SIFT100K and DEEP100K datasets are presented on Figure 3, and the plots for SIFT1M, DEEP1M, GloVe1M are presented on Figure 4.

For all datasets, we observe a consistent improvement over corresponding baseline graphs. We highlight several key observations below:

  • On SIFT100K, the optimized NSG consistently outperforms all other evaluated graphs. In particular, we observe up to 1% improvement compared to the top-performing NSG baseline. On DEEP100K, the optimized NSW graph also outperforms HNSW/NSW graph by up to 1%. For 99+% Recall@1 region, the gains become insignificant. Note that NSG graphs are superior on SIFT data, while NSW/HNSW performs better on the DEEP100K dataset. This is a weakness of heuristic-based similarity graphs: different heuristics are more appropriate for different data. Our RL-based approach may significantly reduce the gap in performance. E.g., while NSG outperforms NSW by up to 2.5% on SIFT100K, the maximum gap between optimized graphs reduces to 0.4%. On DEEP100K, NSW/HNSW outperforms NSG by up to 3.0%, while, for NSW Ours and NSG Ours, the maximum difference is 1.3%.

  • On all datasets, we observe more significant gains for lower Recall@1 regions. While our hypothesis that the RL approach mainly influences the navigation properties of similarity graphs, this observation is consistent with the fact that navigation properties lose their value if the search algorithm’s heap size increases.

  • On all datasets and all Recall@1 regions, the optimized NSW is superior or equal to HNSW, which exploits an additional indexing structure for better navigation. Therefore, the nested hierarchy of graphs is redundant and can be replaced by its bottom layer with improved navigation properties.

  • On the most challenging dataset, GloVe1M, NSW/HNSW graphs demonstrate much worse performance due to the high intrinsic dimensionality of the word embeddings. For this dataset, our approach mitigates the issues of NSW/HNSW graphs and outperforms baselines by 0.4% at 88% Recall@1 point.

4.4 Graph properties analysis

In this section, we analyze the emerging properties of graphs learned by the proposed algorithm. Our primary hypothesis is that the advantage of our method in terms of search efficiency is attributed to its ability to learn more specialized roles for graph vertices, similarly to what we observed in the toy experiment.

In order to test this hypothesis, we study the statistical properties of frequently visited vertices. In both NSW and NSG graphs, there is a small subset of vertices that help the search procedure to navigate during the first few graph hops. Hence, an improvement in these vertices may have a substantial effect on the overall search efficiency.

We consider 40 vertices that are the most frequently visited by the search algorithm. For each vertex, we count its number of visits over 105 training queries. The obtained numbers of visits for baseline graphs and graphs produced by our method are presented on Figure 5.

Figure 5 clearly indicates that graphs produced by our method have a more peaky distribution over vertex visit frequencies compared to both baselines. In other words, directly optimizing graph for nearest neighbor search produces more specialized navigation vertices.

Interestingly, our RL approach can also learn a new starting vertex, see NSW Ours on SIFT100K in Figure 5. The agent omits all edges in initial starting vertex except one. Hence, for every query the search procedure goes to the new starting node by performing only one distance computation. Note that “peakyness” of the distributions from Figure 5 correlates with relative performance of heuristics-based graphs on different datasets. For instance, on SIFT100K NSG has more pronounced hubs and outperforms NSW on this dataset, see Figure 3. In contrast, on DEEP100K, NSW has more “peaky” distribution compared to NSG and provides superior search performance.

We conjecture that our algorithm is better able to learn the edges for the navigation vertices, achieving more accurate routing, compared to heuristics-based counterparts.

4.5 Comparison to heuristic methods

In this experiment, we evaluate our approach against one of the heuristic methods, which can be used for similarity graph improvement.

Here, we consider magnitude-based pruning, where the weights for each edge are computed as follows:

wij=n_visited_eij+λn_visited_vi+λoutdegree(vi) (3)

, where n_visited_eij and n_visited_vi correspond to visitation frequencies for edge eij and vertex vi respectively. We compute these frequencies by running search procedure on training queries. The only hyperparameter λ plays a smoothing role, discouraging radical pruning of rarely visited vertices. In our experiments we always use λ=0.1. Then, we tune a weight threshold to maximize performance for validation queries. Finally, all edges whose weights are below the threshold are pruned.

We compare our RL-based approach and magnitude pruning applied to the NSW graph on DEEP1M, see Figure 6. Our method outperforms magnitude pruning across all distance computation budgets. Finally, we apply magnitude pruning to the graph constructed by RL and observe that it also slightly improves the performance.

Figure 6: Comparison to magnitude-based pruning (MP) for the NSW graph on DEEP1M dataset.

5 Conclusion

In this paper, we introduce a new algorithm for similarity graph construction that explicitly optimizes an adjacency matrix, maximizing the search quality for a large set of training queries. The algorithm defines a probabilistic model of the graph in terms of its edge probabilities and then learns these probabilities in a reinforcement learning scenario. We show that the proposed approach allows to improve the performance of similarity graphs constructed by heuristics.

References

  • Andoni & Indyk (2008) Andoni, A. and Indyk, P. Near-optimal hashing algorithms for near neighbor problem in high dimension. Communications of the ACM, 51(1):117–122, 2008.
  • Andoni et al. (2015) Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I. P., and Schmidt, L. Practical and optimal LSH for angular distance. In NIPS, 2015.
  • Aumüller et al. (2017) Aumüller, M., Bernhardsson, E., and Faithfull, A. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In SISAP, 2017.
  • Babenko & Lempitsky (2016) Babenko, A. and Lempitsky, V. S. Efficient indexing of billion-scale datasets of deep descriptors. In CVPR, 2016.
  • Bentley (1975) Bentley, J. L. Multidimensional binary search trees used for associative searching. Commun. ACM, 18, 1975.
  • Dasgupta & Freund (2008) Dasgupta, S. and Freund, Y. Random projection trees and low dimensional manifolds. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008, pp. 537–546, 2008.
  • Dasgupta & Sinha (2013) Dasgupta, S. and Sinha, K. Randomized partition trees for exact nearest neighbor search. In Conference on Learning Theory, pp. 317–337, 2013.
  • Datar et al. (2004) Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004, pp. 253–262, 2004.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Fu & Cai (2016) Fu, C. and Cai, D. Efanna: An extremely fast approximate nearest neighbor search algorithm based on knn graph. arXiv preprint arXiv:1609.07228, 2016.
  • Fu et al. (2017) Fu, C., Xiang, C., Wang, C., and Cai, D. Fast approximate nearest neighbor search with the navigating spreading-out graph. arXiv preprint arXiv:1707.00143, 2017.
  • Indyk & Motwani (1998) Indyk, P. and Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26, 1998, pp. 604–613, 1998.
  • Iwasaki & Miyazaki (2018) Iwasaki, M. and Miyazaki, D. Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data. arXiv preprint arXiv:1810.07355, 2018.
  • Jégou et al. (2011) Jégou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. TPAMI, 33(1), 2011.
  • Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Kraska et al. (2018) Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, 2018.
  • Kraska et al. (2019) Kraska, T., Alizadeh, M., Beutel, A., Chi, E. H., Ding, J., Kristo, A., Leclerc, G., Madden, S., Mao, H., and Nathan, V. Sagedb: A learned database system. 2019.
  • Laarhoven (2018) Laarhoven, T. Graph-based time-space trade-offs for approximate near neighbors. In 34th International Symposium on Computational Geometry, SoCG 2018, June 11-14, 2018, Budapest, Hungary, 2018.
  • Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • Malkov & Ponomarenko (2016) Malkov, Y. A. and Ponomarenko, A. Growing homophilic networks are natural navigable small worlds. PloS one, 2016.
  • Malkov & Yashunin (2016) Malkov, Y. A. and Yashunin, D. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
  • McCartin-Lim et al. (2012) McCartin-Lim, M., McGregor, A., and Wang, R. Approximate principal direction trees. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
  • Navarro (2002) Navarro, G. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(1):28–46, 2002.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
  • Puterman (1994) Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In ICML, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • Sproull (1991) Sproull, R. F. Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica, 6, 1991.
  • Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. 1991.
  • Wu et al. (2017) Wu, Y., Mansimov, E., Liao, S., Grosse, R. B., and Ba, J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. CoRR, abs/1708.05144, 2017.
  • Xiong et al. (2017) Xiong, W., Hoang, T., and Wang, W. Y. Deeppath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
  • Yu. A. Malkov (2016) Yu. A. Malkov, D. A. Y. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320, 2016.
  • Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

A.1 Hyperparameters SIFT100K

NSW HNSW NSG NSW Ours NSG Ours
M 12 12 - 12 -
efconstruction 500 300 - 300 -
R - - 24 - 24
K - - 200 - 200
efsearch - - - 10 10
DCSmax - - - 1200 1500
Centropy - - - 0.01 0.001

A.2 Hyperparameters SIFT1M

NSW HNSW NSW Ours
M 14 14 14
efconstruction 500 500 500
efsearch - - 12
DCSmax - - 1500
Centropy - - 0.01

A.3 Hyperparameters DEEP100K

NSW HNSW NSG NSW Ours NSG Ours
M 12 12 - 12 -
efconstruction 300 300 - 300 -
R - - 24 - 24
K - - 200 - 200
efsearch - - - 10 10
DCSmax - - - 1000 1500
Centropy - - - 0.01 0.001

A.4 Hyperparameters DEEP1M

NSW HNSW NSW Ours
M 14 14 14
efconstruction 500 500 500
efsearch - - 12
DCSmax - - 1500
Centropy - - 0.01

A.5 Hyperparameters GloVe1M

NSW HNSW NSW Ours
M 20 28 20
efconstruction 2000 2000 2000
efsearch - - 5
DCSmax - - 1000
Centropy - - 0.01