Exposing Attention Glitches with Flip-Flop Language Modeling

Abstract

Why do large language models sometimes output factual inaccuracies andexhibit erroneous reasoning? The brittleness of these models, particularly whenexecuting long chains of reasoning, currently seems to be an inevitable priceto pay for their advanced capabilities of coherently synthesizing knowledge,pragmatics, and abstract thought. Towards making sense of this fundamentallyunsolved problem, this work identifies and analyzes the phenomenon of attentionglitches, in which the Transformer architecture's inductive biasesintermittently fail to capture robust reasoning. To isolate the issue, weintroduce flip-flop language modeling (FFLM), a parametric family of syntheticbenchmarks designed to probe the extrapolative behavior of neural languagemodels. This simple generative task requires a model to copy binary symbolsover long-range dependencies, ignoring the tokens in between. We find thatTransformer FFLMs suffer from a long tail of sporadic reasoning errors, some ofwhich we can eliminate using various regularization techniques. Our preliminarymechanistic analyses show why the remaining errors may be very difficult todiagnose and resolve. We hypothesize that attention glitches account for (someof) the closed-domain hallucinations in natural LLMs.

Quick Read (beta)

loading the full paper ...