Abstract
Lexical matching remains the de facto evaluation method for open-domainquestion answering (QA). Unfortunately, lexical matching fails completely whena plausible candidate answer does not appear in the list of gold answers, whichis increasingly the case as we shift from extractive to generative models. Therecent success of large language models (LLMs) for QA aggravates lexicalmatching failures since candidate answers become longer, thereby makingmatching with the gold answers even more challenging. Without accurateevaluation, the true progress in open-domain QA remains unknown. In this paper,we conduct a thorough analysis of various open-domain QA models, includingLLMs, by manually evaluating their answers on a subset of NQ-open, a popularbenchmark. Our assessments reveal that while the true performance of all modelsis significantly underestimated, the performance of the InstructGPT (zero-shot)LLM increases by nearly +60%, making it on par with existing top models, andthe InstructGPT (few-shot) model actually achieves a new state-of-the-art onNQ-open. We also find that more than 50% of lexical matching failures areattributed to semantically equivalent answers. We further demonstrate thatregex matching ranks QA models consistent with human judgments, although stillsuffering from unnecessary strictness. Finally, we demonstrate that automatedevaluation models are a reasonable surrogate for lexical matching in somecircumstances, but not for long-form answers generated by LLMs. The automatedmodels struggle in detecting hallucinations in LLM answers and are thus unableto evaluate LLMs. At this time, there appears to be no substitute for humanevaluation.