Truth Knows No Language: Evaluating Truthfulness Beyond English

Abstract

We introduce a professionally translated extension of the TruthfulQAbenchmark designed to evaluate truthfulness in Basque, Catalan, Galician, andSpanish. Truthfulness evaluations of large language models (LLMs) haveprimarily been conducted in English. However, the ability of LLMs to maintaintruthfulness across languages remains under-explored. Our study evaluates 12state-of-the-art open LLMs, comparing base and instruction-tuned models usinghuman evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Ourfindings reveal that, while LLMs perform best in English and worst in Basque(the lowest-resourced language), overall truthfulness discrepancies acrosslanguages are smaller than anticipated. Furthermore, we show thatLLM-as-a-Judge correlates more closely with human judgments thanmultiple-choice metrics, and that informativeness plays a critical role intruthfulness assessment. Our results also indicate that machine translationprovides a viable approach for extending truthfulness benchmarks to additionallanguages, offering a scalable alternative to professional translation.Finally, we observe that universal knowledge questions are better handledacross languages than context- and time-dependent ones, highlighting the needfor truthfulness evaluations that account for cultural and temporalvariability. Dataset and code are publicly available under open licenses.

Quick Read (beta)

loading the full paper ...