ReQA: An Evaluation for End-to-End Answer Retrieval Models

  • 2019-07-10 15:16:36
  • Amin Ahmad, Noah Constant, Yinfei Yang, Daniel Cer
  • 7


Popular QA benchmarks like SQuAD have driven progress on the task ofidentifying answer spans within a specific passage, with models now surpassinghuman performance. However, retrieving relevant answers from a huge corpus ofdocuments is still a challenging problem, and places different requirements onthe model architecture. There is growing interest in developing scalable answerretrieval models trained end-to-end, bypassing the typical document retrievalstep. In this paper, we introduce Retrieval Question Answering (ReQA), abenchmark for evaluating large-scale sentence- and paragraph-level answerretrieval models. We establish baselines using both neural encoding models aswell as classical information retrieval techniques. We release our evaluationcode to encourage further work on this challenging task.


Quick Read (beta)

ReQA: An Evaluation for End-to-End Answer Retrieval Models

Amin Ahmad, Noah Constant, Yinfei Yang, Daniel Cer
Google Research, Mountain View, USA
{aahmad, nconstant, yinfeiy, cer}@google

Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question-Answering (ReQA), a benchmark for evaluating large-scale sentence- and paragraph-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.


Input: \algrenewcommand\algorithmicensureOutput:


ReQA: An Evaluation for End-to-End Answer Retrieval Models

  Amin Ahmad, Noah Constant, Yinfei Yang, Daniel Cer Google Research, Mountain View, USA {aahmad, nconstant, yinfeiy, cer}@google


noticebox[b]Preprint. Work in progress.\[email protected]

1 Introduction

Popular QA benchmarks like SQuAD (Rajpurkar et al., 2016) have driven impressive progress on the task of identifying spans of text within a specific passage that answer a posed question. Recent models using BERT pretraining (Devlin et al., 2019) have already surpassed human performance on SQuAD 1.1 and 2.0.

While impressive, these systems are not yet sufficient for the end task of answering user questions at scale, since in general, we don’t know which documents are likely to contain an answer. On the one hand, typical document retrieval solutions fall short here, since they aren’t trained to directly model the connection between questions and answers in context. For example, in Figure 1, a relevant answer appears on the Wikipedia page for New York, but this document is unlikely to be retrieved, as the larger document is not highly relevant to the question. On the other hand, QA models with strong performance on reading comprehension can’t be used directly for large-scale retrieval. This is because competitive QA models use interactions between the question and candidate answer in the early stage of modeling (e.g. through cross-attention) making it infeasible to score a large set of candidates at inference time.

Question: Which US county has the densest population? Wikipedia Page: New York City Answer: Geographically co-extensive with New York County, the borough of Manhattan’s 2017 population density of 72,918 inhabitants per square mile (28,154/km2) makes it the highest of any county in the United States and higher than the density of any individual American city.

Figure 1: A hypothetical example of end-to-end answer retrieval, where the document containing the answer is not “on topic” for the question.

There is growing interest in training end-to-end answer retrieval systems that can efficiently surface relevant results without an intermediate document retrieval phase (Gillick et al., 2018; Cakaloglu et al., 2018; Henderson et al., 2019). We are excited by this direction, and hope to promote further research by offering the Retrieval Question-Answering (ReQA) benchmark, which tests a model’s ability to retrieve relevant answers efficiently from a large set of documents. Our code is available at

The remainder of the paper is organized as follows: In Section 2, we define our goals in developing large-scale answer retrieval models. Section 3 describes our method for transforming a within-document reading comprehension task like SQuAD into a Retrieval Question-Answering (ReQA) task, and details our evaluation procedure and metrics. Section 4 describes various neural and non-neural baseline models, and characterizes their performance on the ReQA SQuAD task. Finally, Section 5 discusses related work.

2 Objectives

What properties would we like a large-scale answer retrieval model to have? We discuss five characteristics below that motivate the design of our evaluation.

First, we would like an end-to-end solution. As illustrated in Figure 1, some answers are found in surprising places. Pipelined systems that first retrieve topically relevant documents and then search for answer spans within only those documents risk missing all good answers from documents that appear to have less overall relevance to the question.

Second, we need efficient retrieval, with the ability to scale to billions of answers. Here we impose a specific condition that guarantees scalability. We require the model to encode questions and answers independently as high-dimensional (e.g. 512d) vectors, such that the relevance of a QA pair can be computed by taking their dot-product, as in Henderson et al. (2017).11 1 Other distance metrics are possible. Another popular option for nearest neighbor search is cosine distance. Note, models using cosine distance can still compute relevance through a dot-product, provided the final encoding vectors are L2-normalized. This technique enables retrieval of relevant answers using approximate nearest neighbor search, which is sub-linear in the number of documents, and in practice close to log(N). This condition rules out the powerful models like BERT that perform best on reading comprehension metrics. Note, these approaches could be used to rerank a small set of retrieved candidate answers, but we leave the evaluation of such multi-stage systems to future work.

Third, we focus on sentence-level retrieval. In practice, sentences are a good size to present a user with a “detailed” answer. Note, in cases where highlighting the relevant span within a sentence is important, a separate highlighting module could be learned that takes a retrieved sentence as input. Similarly, future work could improve on extractive solutions by adding a generative component to synthesize answers out of the extracted answer(s).

Fourth, a retrieval model should be context aware, in the sense that the context surrounding a sentence should affect its appropriateness as an answer. For example, an ideal QA system should be able to tell that the bolded sentence in Figure 2 is a good answer to the question, since the context makes it clear that “The official language” refers to the official language of Nigeria.

Question: What is Nigeria’s official language? Answer in Context: […] Nigeria has one of the largest populations of youth in the world. The country is viewed as a multinational state, as it is inhabited by over 500 ethnic groups, of which the three largest are the Hausa, Igbo and Yoruba; these ethnic groups speak over 500 different languages, and are identified with wide variety of cultures. The official language is English. […]

Figure 2: An example from SQuAD 1.1 where looking at the surrounding context is necessary to determine the relevance of the answer sentence.

Finally, we believe a strong model should be general purpose, with the ability to generalize to new domains and datasets gracefully. For this reason, we advocate using a retrieval evaluation drawn from a specific task/domain that is never used for model training. In the case of our SQuAD-based evaluation, we evaluate on retrieval over the entire SQuAD 1.1 training set, with the understanding that all SQuAD data is off-limits for model training. Additionally, we recommend not training on Wikipedia, as this is the source of the SQuAD passages. This increases our confidence that a model that evaluates well on our retrieval metrics can be applied to a wide range of open-domain QA tasks.22 2 We strongly assert that when NLP models are used in applied systems, it is generally preferable to evaluate alternative models using data that is as distinct as reasonably possible from model training data. While this is common practice in some sub-fields of NLP such as machine translation, it is still unfortunately very common to assess other NLP models on dev and test data that is very similar to its training data (e.g., harvested from the same source using a common pipeline and a common pool of annotators). This makes it more difficult to interpret claims of models approaching “human-level” performance.

3 Method

In this section, we describe our method for constructing the Retrieval Question-Answering (ReQA) task from SQuAD. However, our approach is general and can be used to construct a ReQA task from many existing machine reading based QA tasks.

3.1 SQuAD 1.1

SQuAD 1.1 is a reading comprehension challenge that consists of over 100,000 questions composed to be answerable by text from Wikipedia articles. The data is organized into paragraphs, where each paragraph has multiple associated questions. Each question can have one or more answers in its paragraph.33 3 Typically, multiple answers come from the same sentence. For example, the question “Where did Super Bowl 50 take place?” is associated with three answers found within the sentence “The game was played on February 7, 2016, at Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California.” The answer spans are: [Santa Clara, California], [Levi’s Stadium] and [Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California.].

We choose SQuAD 1.1 for our initial ReQA evaluation because it is a widely studied dataset, and covers many question types. However it is equally possible to construct ReQA tasks from other QA datasets including the more recent SQuAD 2.0 (Rajpurkar et al., 2018).44 4 SQuAD 2.0 includes questions with no answer in the paragraph. While these questions are useful for testing machine reading over fixed passages, we can’t be sure that such questions aren’t answered by another sentence in the larger corpus. Other datasets amenable to ReQA conversion include MS MARCO (Nguyen et al., 2016), TriviaQA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019). These tasks improve upon SQuAD by using real questions that weren’t “back-written” with advance knowledge of the answer. However, for these tasks, we note that the answers are derived from web documents retrieved by Bing or Google where the question is used as the search query.55 5 For MS MARCO and TriviaQA, the top 10 or 50 web documents returned by Bing are used. For Natural Questions, documents are limited to Wikipedia articles that appeared in the top five Google search results. This introduces a bias toward answers that are already retrievable through traditional search methods. Recall from Figure 1 that many answers in SQuAD 1.1 are found in “off-topic” documents, and we would like our evaluation to measure the ability to retrieve such answers.

3.2 Retrieval SQuAD

To turn SQuAD into a retrieval task, we first split each paragraph into sentences using a custom sentence-breaking tool included in our public release. For the SQuAD 1.1 train set, splitting 18,896 paragraphs produces 91,707 sentences. Next, we construct an “answer index” containing each sentence as a candidate answer. The model being evaluated computes an answer embedding for each answer (using any encoding strategy), given only the sentence and its surrounding paragraph as input. Crucially, this computation must be done independently of any specific question. The answer index construction processs is described more formally in Algorithm 3.2.

Similarly, we embed each question using the model’s question encoder, with the restriction that only the question text be used. For the SQuAD 1.1 train set, this gives 87,599 questions.


Constructing the answer index \[email protected]@algorithmic[1] \Requirec is a representation of a SQuAD dataset; S is a function that accepts a string of text, s and returns a sequence of sentences, [s0,s1,,sn]; Ea is the embedding function, which takes answer text, a, into points in n. \EnsureA list of sentence, encoding tuples. \Statex\FunctionEncodeIndexc, S, Ea \StateI new list \Forx in \Commentfor every passage \Forp in x.paragraph \Fors in S(p.context) \Statese Ea(s,p.context) \Stateappend s,se to I \EndFor\EndFor\EndFor\State\ReturnI \EndFunction

After all questions and answers are encoded, we compute a “relevance score” for each question-answer pair by taking the dot-product of the question and answer embeddings. These scores can be used to rank all 91K candidate answers for every question, and compute standard ranking metrics such as mean reciprocal rank (MRR) and recall ([email protected]).


Scoring questions and answers \[email protected]@algorithmic[1] \RequireQ[q×n] is a matrix of question embeddings in n, arranged so that the i-th row, Q[i] corresponds to the embedding of qi; A[a×n] is a matrix of answer embeddings, also in n, derived from the answer index, I, and arranged so that the i-th row, A[i] corresponds to the embedding of ai. \EnsureR[q×a] a matrix of ranking data that can be used to compute metrics such as MRR and [email protected] It is arranged so that i-th row is a vector of dot-product scores for qi, that is, [qia0,qia1,,qiaa] \Statex\FunctionScoreQ, A \StateS[q×a] QAT \Commentcompute dot-products \StateR[q×a] new matrix \Fori1 to q \StateR[i] rankdata(S[i]) \EndFor\State\ReturnR \EndFunction

3.3 Known Shortcomings

There are a few known shortcomings of the proposed construction process. First, a question may become ambiguous or underspecified when removed from the context of a specific document and paragraph. For example, SQuAD 1.1 contains the question “What instrument did he mostly compose for?” This question makes sense in the original context of the Wikipedia article on Frédéric Chopin, but is underspecified when asked in isolation, and could reasonably have other answers. One possible resolution would be to include the context title as part of the question context. However this is unrealistic from the point of view of end systems where the user doesn’t have a specific document in mind.

A related but rarer issue arises when exactly the same question is asked in different contexts, as in Figure 3. In this case, we consider both answers correct for the purposes of our evaluation metrics.

Despite these shortcomings, we show in Section 4 that the constructed dataset is still challenging for existing approaches, motivating the development of better answer retrieval models. We leave the improvement of the construction process for future work.

Question: How tall is Mount Olympus? Wikipedia Page: Greece Answer: Mount Olympus, the mythical abode of the Greek Gods, culminates at Mytikas peak 2,918 metres (9,573 ft), the highest in the country.   Question: How tall is Mount Olympus? Wikipedia Page: Cyprus Answer: The highest point on Cyprus is Mount Olympus at 1,952 m (6,404 ft), located in the centre of the Troodos range.

Figure 3: Two identical questions from SQuAD 1.1 with answers on different pages.

4 Models and Results

In this section we evaluate neural models and classic information retrieval techniques on the ReQA benchmark.

4.1 Neural Baselines

Dual encoder models are learned functions that collocate queries and results in a shared embedding space. This architecture has shown strong performance on sentence-level retrieval tasks, including conversational response retrieval (Henderson et al., 2017; Yang et al., 2018), translation pair retrieval (Guo et al., 2018; Yang et al., 2019b) and similar text retrieval (Gillick et al., 2018). A dual encoder for use with ReQA has the schematic shape illustrated in Figure 4.

Figure 4: A schematic dual encoder for question-answer retrieval.

As our primary neural baseline, we take the recently released universal sentence encoder QA (USE-QA) model from Yang et al. (2019c)66 6 This is a multilingual QA retrieval model that co-trains a question-answer dual encoder along with secondary tasks of translation ranking and natural language inference. The model uses sub-word tokenization, with a 128k “sentencepiece” vocabulary (Kudo and Richardson, 2018). Question and answer text are encoded independently using a 6-layer transformer encoder (Vaswani et al., 2017), and then reduced to a fixed-length vector through average pooling. The final encoding dimensionality is 512. The training corpus contains over a billion question-answer pairs from popular online forums and QA websites like Reddit and StackOverflow.

As a second neural baseline, we include an internal QA model (QALite) designed for use on mobile devices. Like USE-QA, this model is trained over online forum data, and uses a transformer-based text encoder. The core differences are reduction in width and depth of model layers, and reduction of sub-word vocabulary size.

Table 1 presents the ReQA SQuAD results for our baseline models. As expected, the larger USE-QA model outperforms the smaller QALite model. The [email protected] score of 0.439 indicates that USE-QA is able to retrieve the correct answer from a pool of 91,707 candidates roughly 44% of the time. Table 2 illustrates the tradeoff between model accuracy and resource usage.

Model MRR [email protected] [email protected] [email protected]
  USE-QA 0.539 0.439 0.656 0.727
QALite 0.412 0.325 0.507 0.576
Table 1: Mean reciprocal rank (MRR) and [email protected] performance of neural baselines on ReQA SQuAD.
Model Size Latency 77 7 This is the latency for encoding a single piece of text. However, by batching the encoding requests, it’s possible to significantly reduce the amortized encoding time. In practice, batch sizes of 200 provide an amortized speedup of up to 5x. Memory
(MB) (ms) (MB)
USE-QA 392.9 17.3 71.8
QALite 2.6 10.2 3.6
Table 2: Time and space tradeoffs of different models. Latency was measured on a PC with 12 3492MHz CPUs.

4.2 BM25 Baseline

While neural retrieval systems are gaining popularity, TF-IDF based methods remain the dominant method for document retrieval, with the BM25 family of ranking functions providing a strong baseline (Robertson and Zaragoza, 2009). Unlike the neural models described above that can directly retrieve content at the sentence level, such methods generally consist of two stages: document retrieval, followed by sentence highlighting (Mitra and Craswell, 2018). Previous work in open domain question answering has shown that BM25 is a difficult baseline to beat when questions were written with advance knowledge of the answer (Lee et al., 2019).

To obtain our baseline using traditional IR methods, we constructed a paragraph-level retrieval task which allows a direct comparison between the neural systems in Table 1 and BM25. We evaluate BM25 by measuring its ability to recall the paragraph containing the answer to the question.88 8 Our experiments make use of the implementation at with default hyperparameter settings. To get a paragraph retrieval score for our neural baselines, we run sentence retrieval as before, and use the retrieved sentence to select the enclosing paragraph. As shown in Table 3, the USE-QA neural baseline outperforms BM25 on paragraph retrieval.

Model MRR [email protected] [email protected] [email protected]
USE-QA 0.634 0.533 0.756 0.823
QALite 0.503 0.407 0.613 0.689
BM2599 9 Statistics were computed on a subset of 9,425 questions, due to slow scoring speed. 0.609 0.524 0.710 0.764
Table 3: Performance of various models on paragraph-level retrieval.

5 Related Work

Open domain question answering is the problem of answering a question from a large collection of documents (Voorhees and Tice, 2000). Successful systems usually follow a two-step approach to answer a given question: first retrieve relevant articles or blocks, and then scan the returned text to identify the answer using a reading comprehension model (Jurafsky and Martin, 2018; Kratzwald and Feuerriegel, 2018; Yang et al., 2019a; Lee et al., 2019). While the reading comprehension step has been widely studied with many existing datasets (Rajpurkar et al., 2016; Nguyen et al., 2016; Dunn et al., 2017; Kwiatkowski et al., 2019), machine reading at scale is still a challenging task for the community.

Chen et al. (2017) recently proposed DrQA, treating Wikipedia as a knowledge base over which to answer factoid questions from SQuAD (Rajpurkar et al., 2016), CuratedTREC (Baudiš and Šedivý, 2015) and other sources. The task measures how well a system can successfully extract the answer span given a question, but it still relies on a document retrieval step. The ReQA eval differs from DrQA task by skipping the intermediate step and retrieving the answer sentence directly.

There is also a growing interest in answer selection at scale. Surdeanu et al. (2008) constructs a dataset with 142,627 question-answer pairs from Yahoo! Answers, with the goal of retrieving the right answer from all answers given a question. However, the dataset is limited to “how to” questions, which simplifies the problem by restricting it to a specific domain. Additionally the underlying data is not as broadly accessible as SQuAD and other more recent QA datasets, due to more restrictive terms of use.

WikiQA (Yang et al., 2015) is another task involving large-scale sentence-level answer selection. The candidate sentences are, however, limited to a small set of documents returned by Bing search, and is much smaller than the scale of ReQA over SQuAD. WikiQA consists of 3,047 questions and 29,258 candidate answers, while ReQA over SQuAD contains 87,599 questions and 91,707 candidate answers. Moreover, as discussed in Section 3.1, restricting the domain of answers to top search engine results limits the evaluation’s applicability for testing end-to-end retrieval.

Finally, Cakaloglu et al. (2018) made use of SQuAD for a retrieval task at the paragraph level. We extend this work by investigating sentence level retrieval and by providing strong sentence-level and paragraph-level baselines over a replicable construction of a retrieval evaluation set from the SQuAD data. Further, while Cakaloglu et al. (2018) trained their model on data drawn from SQuAD, we would like to highlight that our own strong baselines do not make use of any training data from SQuAD. We advocate for future work to attempt a similar approach of using sources of model training and evaluation data that are distinct as possible in order to provide a better picture of how well models generally perform a task.

6 Conclusion

In this paper, we introduce Retrieval Question-Answering (ReQA), as a new benchmark for evaluating end-to-end answer retrieval models. The task assesses how well models are able to retrieve relevant sentence-level answers to queries from a large corpus. We describe a general method for converting reading comprehension QA tasks into cross-document answer retrieval tasks. Using SQuAD 1.1 as an example, we construct the ReQA SQuAD task, and evaluate several models on sentence- and paragraph-level answer retrieval. We find that a freely available neural baseline, USE QA, outperforms a strong information retrieval baseline, BM25, on paragraph retrieval, suggesting that end-to-end answer retrieval can offer improvements over pipelined systems that first retrieve documents and then select answers within. We release our code for both evaluation and conversion of the SQuAD 1.1 dataset for the ReQA task.


We thank our teammates from Descartes and other Google groups for their feedback and suggestions. We would like to recognize Javier Snaider, Igor Krivokon and Ray Kurzweil. Special thanks goes to Jonni Kanerva for developing the tailored sentence-breaking algorithm.


  • Baudiš and Šedivý (2015) Petr Baudiš and Jan Šedivý. 2015. Modeling of the question answering task in the yodaqa system. In Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283, CLEF’15, pages 222–228, Berlin, Heidelberg. Springer-Verlag.
  • Cakaloglu et al. (2018) Tolgahan Cakaloglu, Christian Szegedy, and Xiaowei Xu. 2018. Text embeddings for retrieval from a large knowledge base. CoRR, abs/1810.10176.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Güney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, abs/1704.05179.
  • Gillick et al. (2018) Daniel Gillick, Alessandro Presta, and Gaurav Singh Tomar. 2018. End-to-end retrieval in continuous space. CoRR, abs/1811.08008.
  • Guo et al. (2018) Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176, Belgium, Brussels. Association for Computational Linguistics.
  • Henderson et al. (2017) Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. CoRR, abs/1705.00652.
  • Henderson et al. (2019) Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrkšić, and Pei-Hao Su. 2019. Training neural response selection for task-oriented dialogue systems. arXiv preprint arXiv:1906.01543.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics.
  • Jurafsky and Martin (2018) Daniel Jurafsky and James H. Martin. 2018. Speech and Language Processing (3rd Edition, in draft). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  • Kratzwald and Feuerriegel (2018) Bernhard Kratzwald and Stefan Feuerriegel. 2018. Adaptive document retrieval for deep question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 576–581, Brussels, Belgium. Association for Computational Linguistics.
  • Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300.
  • Mitra and Craswell (2018) Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, 13(1):1–126.
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  • Robertson and Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  • Surdeanu et al. (2008) Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. 2008. Learning to rank answers on large online QA collections. In Proceedings of ACL-08: HLT, pages 719–727, Columbus, Ohio. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS. Curran Associates, Inc.
  • Voorhees and Tice (2000) Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, pages 200–207, New York, NY, USA. ACM.
  • Yang et al. (2019a) Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019a. End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 72–77, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal. Association for Computational Linguistics.
  • Yang et al. (2019b) Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019b. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. CoRR, abs/1902.08564.
  • Yang et al. (2019c) Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019c. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.
  • Yang et al. (2018) Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174, Melbourne, Australia. Association for Computational Linguistics.