Abstract
Popular QA benchmarks like SQuAD have driven progress on the task ofidentifying answer spans within a specific passage, with models now surpassinghuman performance. However, retrieving relevant answers from a huge corpus ofdocuments is still a challenging problem, and places different requirements onthe model architecture. There is growing interest in developing scalable answerretrieval models trained end-to-end, bypassing the typical document retrievalstep. In this paper, we introduce Retrieval Question Answering (ReQA), abenchmark for evaluating large-scale sentence- and paragraph-level answerretrieval models. We establish baselines using both neural encoding models aswell as classical information retrieval techniques. We release our evaluationcode to encourage further work on this challenging task.