MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Abstract

The tendency of Large Language Models (LLMs) to generate hallucinationsraises concerns regarding their reliability. Therefore, confidence estimationsindicating the extent of trustworthiness of the generations become essential.However, current LLM confidence estimations in languages other than Englishremain underexplored. This paper addresses this gap by introducing acomprehensive investigation of Multilingual Confidence estimation (MlingConf)on LLMs, focusing on both language-agnostic (LA) and language-specific (LS)tasks to explore the performance and language dominance effects of multilingualconfidence estimations on different tasks. The benchmark comprises fourmeticulously checked and human-evaluate high-quality multilingual datasets forLA tasks and one for the LS task tailored to specific social, cultural, andgeographical contexts of a language. Our experiments reveal that on LA tasksEnglish exhibits notable linguistic dominance in confidence estimations thanother languages, while on LS tasks, using question-related language to promptLLMs demonstrates better linguistic dominance in multilingual confidenceestimations. The phenomena inspire a simple yet effective native-tone promptingstrategy by employing language-specific prompts for LS tasks, effectivelyimproving LLMs' reliability and accuracy on LS tasks.

Quick Read (beta)

loading the full paper ...