Understanding and Mitigating Language Confusion in LLMs

Abstract

We investigate a surprising limitation of LLMs: their inability toconsistently generate text in a user's desired language. We create the LanguageConfusion Benchmark (LCB) to evaluate such failures, covering 15 typologicallydiverse languages with existing and newly-created English and multilingualprompts. We evaluate a range of LLMs on monolingual and cross-lingualgeneration reflecting practical use cases, finding that Llama Instruct andMistral models exhibit high degrees of language confusion and even thestrongest models fail to consistently respond in the correct language. Weobserve that base and English-centric instruct models are more prone tolanguage confusion, which is aggravated by complex prompts and high samplingtemperatures. We find that language confusion can be partially mitigated viafew-shot prompting, multilingual SFT and preference tuning. We release ourlanguage confusion benchmark, which serves as a first layer of efficient,scalable multilingual evaluation athttps://github.com/for-ai/language-confusion.

Quick Read (beta)

loading the full paper ...