Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

Abstract

Language confusion -- where large language models (LLMs) generate unintendedlanguages against the user's need -- remains a critical challenge, especiallyfor English-centric models. We present the first mechanistic interpretability(MI) study of language confusion, combining behavioral benchmarking withneuron-level analysis. Using the Language Confusion Benchmark (LCB), we showthat confusion points (CPs) -- specific positions where language switches occur-- are central to this phenomenon. Through layer-wise analysis with TunedLensand targeted neuron attribution, we reveal that transition failures in thefinal layers drive confusion. We further demonstrate that editing a small setof critical neurons, identified via comparative analysis withmultilingual-tuned models, substantially mitigates confusion without harminggeneral competence or fluency. Our approach matches multilingual alignment inconfusion reduction for most languages and yields cleaner, higher-qualityoutputs. These findings provide new insights into the internal dynamics of LLMsand highlight neuron-level interventions as a promising direction for robust,interpretable multilingual language modeling.

Quick Read (beta)

loading the full paper ...