NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Abstract

In evaluating the long-context capabilities of large language models (LLMs),identifying content relevant to a user's query from original long documents isa crucial prerequisite for any LLM to answer questions based on long text. Wepresent NeedleBench, a framework consisting of a series of progressively morechallenging tasks for assessing bilingual long-context capabilities, spanningmultiple length intervals (4k, 8k, 32k, 128k, 200k, 1000k, and beyond) anddifferent depth ranges, allowing the strategic insertion of critical datapoints in different text depth zones to rigorously test the retrieval andreasoning capabilities of models in diverse contexts. We use the NeedleBenchframework to assess how well the leading open-source models can identify keyinformation relevant to the question and apply that information to reasoning inbilingual long texts. Furthermore, we propose the Ancestral Trace Challenge(ATC) to mimic the complexity of logical reasoning challenges that are likelyto be present in real-world long-context tasks, providing a simple method forevaluating LLMs in dealing with complex long-context situations. Our resultssuggest that current LLMs have significant room for improvement in practicallong-context applications, as they struggle with the complexity of logicalreasoning challenges that are likely to be present in real-world long-contexttasks. All codes and resources are available at OpenCompass:https://github.com/open-compass/opencompass.

Quick Read (beta)

loading the full paper ...