Abstract
Information retrieval in Large Language Models (LLMs) is increasinglyrecognized as intertwined with generation capabilities rather than mere lookup.While longer contexts are often assumed to improve retrieval, the effects ofintra-context interference remain understudied. To address this, we adapt theproactive interference (PI) paradigm from cognitive science, where earlierinformation disrupts recall of newer updates. In humans, susceptibility to suchinterference is inversely linked to working memory capacity. We introducePI-LLM, an evaluation that sequentially streams semantically related key-valueupdates and queries only the final values. Although these final values areclearly positioned just before the query, LLM retrieval accuracy declineslog-linearly toward zero as interference accumulates; errors arise fromretrieving previously overwritten values. Attempts to mitigate interference viaprompt engineering (e.g., instructing models to ignore earlier input) yieldlimited success. These findings reveal a fundamental constraint on LLMs'ability to disentangle interference and flexibly manipulate information,suggesting a working memory bottleneck beyond mere context access. This callsfor approaches that strengthen models' ability to suppress irrelevant contentduring retrieval.