Abstract
Existing benchmarks for evaluating long-context language models (LCLMs)primarily focus on long-context recall, requiring models to produce shortresponses based on a few critical snippets while processing thousands ofirrelevant tokens. We introduce LongProc (Long Procedural Generation), a newbenchmark that requires both the integration of highly dispersed informationand long-form generation. LongProc consists of six diverse proceduralgeneration tasks, such as extracting structured information from HTML pagesinto a TSV format and executing complex search procedures to create travelplans. These tasks challenge LCLMs by testing their ability to follow detailedprocedural instructions, synthesize and reason over dispersed information, andgenerate structured, long-form outputs (up to 8K tokens). Furthermore, as thesetasks adhere to deterministic procedures and yield structured outputs, theyenable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc acrossthree difficulty levels, with maximum numbers of output tokens set at 500, 2K,and 8K. Notably, while all tested models claim a context window size above 32Ktokens, open-weight models typically falter on 2K-token tasks, andclosed-source models like GPT-4o show significant degradation on 8K-tokentasks. Further analysis reveals that LCLMs struggle to maintain long-rangecoherence in long-form generations. These findings highlight criticallimitations in current LCLMs and suggest substantial room for improvement. Dataand code available at: https://princeton-pli.github.io/LongProc