Abstract
The widespread success of large language models (LLMs) on NLP benchmarks hasbeen accompanied by concerns that LLMs function primarily as stochastic parrotsthat reproduce texts similar to what they saw during pre-training, oftenerroneously. But what is the nature of their errors, and do these errorsexhibit any regularities? In this work, we examine irrelevant contexthallucinations, in which models integrate misleading contextual cues into theirpredictions. Through behavioral analysis, we show that these errors result froma structured yet flawed mechanism that we term class-based (mis)generalization,in which models combine abstract class cues with features extracted from thequery or context to derive answers. Furthermore, mechanistic interpretabilityexperiments on Llama-3, Mistral, and Pythia across 39 factual recall relationtypes reveal that this behavior is reflected in the model's internalcomputations: (i) abstract class representations are constructed in lowerlayers before being refined into specific answers in higher layers, (ii)feature selection is governed by two competing circuits -- one prioritizingdirect query-based reasoning, the other incorporating contextual cues -- whoserelative influences determine the final output. Our findings provide a morenuanced perspective on the stochastic parrot argument: through form-basedtraining, LLMs can exhibit generalization leveraging abstractions, albeit inunreliable ways based on contextual cues -- what we term stochastic chameleons.