Abstract
In the age of misinformation, hallucination -- the tendency of Large LanguageModels (LLMs) to generate non-factual or unfaithful responses -- represents themain risk for their global utility. Despite LLMs becoming increasinglymultilingual, the vast majority of research on detecting and quantifying LLMhallucination are (a) English-centric and (b) focus on machine translation (MT)and summarization, tasks that are less common ``in the wild'' than openinformation seeking. In contrast, we aim to quantify the extent of LLMhallucination across languages in knowledge-intensive long-form questionanswering. To this end, we train a multilingual hallucination detection modeland conduct a large-scale study across 30 languages and 6 open-source LLMfamilies. We start from an English hallucination detection dataset and rely onMT to generate (noisy) training data in other languages. We also manuallyannotate gold data for five high-resource languages; we then demonstrate, forthese languages, that the estimates of hallucination rates are similar betweensilver (LLM-generated) and gold test sets, validating the use of silver datafor estimating hallucination rates for other languages. For the final ratesestimation, we build a knowledge-intensive QA dataset for 30 languages withLLM-generated prompts and Wikipedia articles as references. We find that, whileLLMs generate longer responses with more hallucinated tokens forhigher-resource languages, there is no correlation between length-normalizedhallucination rates of languages and their digital representation. Further, wefind that smaller LLMs exhibit larger hallucination rates than larger models.