Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators

Abstract

Previous research has shown that LLMs have potential in multilingual NLGevaluation tasks. However, existing research has not fully explored thedifferences in the evaluation capabilities of LLMs across different languages.To this end, this study provides a comprehensive analysis of the multilingualevaluation performance of 10 recent LLMs, spanning high-resource andlow-resource languages through correlation analysis, perturbation attacks, andfine-tuning. We found that 1) excluding the reference answer from the promptand using large-parameter LLM-based evaluators leads to better performanceacross various languages; 2) most LLM-based evaluators show a highercorrelation with human judgments in high-resource languages than inlow-resource languages; 3) in the languages where they are most sensitive tosuch attacks, they also tend to exhibit the highest correlation with humanjudgments; and 4) fine-tuning with data from a particular language yields abroadly consistent enhancement in the model's evaluation performance acrossdiverse languages. Our findings highlight the imbalance in LLMs'evaluationcapabilities across different languages and suggest that low-resource languagescenarios deserve more attention.

Quick Read (beta)

loading the full paper ...