Abstract
A recent trend in LLMs is developing recurrent sub-quadratic models thatimprove long-context processing efficiency. We investigate leading largelong-context models, focusing on how their fixed-size recurrent memory affectstheir performance. Our experiments reveal that, even when these models aretrained for extended contexts, their use of long contexts remainsunderutilized. Specifically, we demonstrate that a chunk-based inferenceprocedure, which identifies and processes only the most relevant portion of theinput can mitigate recurrent memory failures and be effective for manylong-context tasks: On LongBench, our method improves the overall performanceof Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%,RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, thissimple approach also leads to state-of-the-art results in the challengingLongBench v2 benchmark, showing competitive performance with equivalent sizeTransformers. Furthermore, our findings raise questions about whether recurrentmodels genuinely exploit long-range dependencies, as our single-chunk strategydelivers stronger performance - even in tasks that presumably requirecross-context relations.