Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models

Abstract

Although the multilingual capability of LLMs offers new opportunities toovercome the language barrier, do these capabilities translate into real-lifescenarios where linguistic divide and knowledge conflicts between multilingualsources are known occurrences? In this paper, we studied LLM's linguisticpreference in a cross-language RAG-based information search setting. We foundthat LLMs displayed systemic bias towards information in the same language asthe query language in both document retrieval and answer generation.Furthermore, in scenarios where no information is in the language of the query,LLMs prefer documents in high-resource languages during generation, potentiallyreinforcing the dominant views. Such bias exists for both factual andopinion-based queries. Our results highlight the linguistic divide withinmultilingual LLMs in information search systems. The seemingly beneficialmultilingual capability of LLMs may backfire on information parity byreinforcing language-specific information cocoons or filter bubbles furthermarginalizing low-resource views.

Quick Read (beta)

loading the full paper ...