Large Language Models Struggle to Learn Long-Tail Knowledge

Abstract

The internet contains a wealth of knowledge -- from the birthdays ofhistorical figures to tutorials on how to code -- all of which may be learnedby language models. However, there is a huge variability in the number of timesa given piece of information appears on the web. In this paper, we study therelationship between the knowledge memorized by large language models and theinformation in their pre-training datasets. In particular, we show that alanguage model's ability to answer a fact-based question relates to how manydocuments associated with that question were seen during pre-training. Weidentify these relevant documents by entity linking pre-training datasets andcounting documents that contain the same entities as a given question-answerpair. Our results demonstrate strong correlational and causal relationshipsbetween accuracy and relevant document count for numerous question answeringdatasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes(e.g., 176B parameters). Moreover, we find that while larger models are betterat learning long-tail knowledge, we estimate that today's models must be scaledby many orders of magnitude to reach competitive QA performance on questionswith little support in the pre-training data. Finally, we show thatretrieval-augmentation can reduce the dependence on relevant document count,presenting a promising approach for capturing the long-tail.

Quick Read (beta)

loading the full paper ...