HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Abstract

Many benchmarks exist for evaluating long-context language models (LCLMs),yet developers often rely on synthetic tasks such as needle-in-a-haystack(NIAH) or an arbitrary subset of tasks. However, it remains unclear whetherthese benchmarks reflect the diverse downstream applications of LCLMs, and suchinconsistencies further complicate model comparison. We investigate theunderlying reasons behind these practices and find that existing benchmarksoften provide noisy signals due to limited coverage of applications,insufficient context lengths, unreliable metrics, and incompatibility with basemodels. In this work, we introduce HELMET (How to Evaluate Long-context ModelsEffectively and Thoroughly), a comprehensive benchmark encompassing sevendiverse, application-centric categories. We also address several issues inprevious benchmarks by adding controllable lengths up to 128K tokens,model-based evaluation for reliable metrics, and few-shot prompting forrobustly evaluating base models. Consequently, we demonstrate that HELMEToffers more reliable and consistent rankings of frontier LCLMs. Through acomprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH donot reliably predict downstream performance; (2) the diverse categories inHELMET exhibit distinct trends and low correlations with each other; and (3)while most LCLMs achieve perfect NIAH scores, open-source models significantlylag behind closed ones when tasks require full-context reasoning or followingcomplex instructions -- the gap widens as length increases. Finally, werecommend using our RAG tasks for fast model development, as they are easy torun and better predict other downstream performance; ultimately, we advocatefor a holistic evaluation across diverse tasks.

Quick Read (beta)

loading the full paper ...