Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Abstract

Large Language Models (LLMs) have demonstrated impressive performance onNatural Language Processing (NLP) tasks, such as Question Answering,Summarization, and Classification. The use of LLMs as evaluators, that can rankor score the output of other models (usually LLMs) has become increasinglypopular, due to the limitations of current evaluation techniques including thelack of appropriate benchmarks, metrics, cost, and access to human annotators.While LLMs are capable of handling approximately 100 languages, the majority oflanguages beyond the top 20 lack systematic evaluation across various tasks,metrics, and benchmarks. This creates an urgent need to scale up multilingualevaluation to ensure a precise understanding of LLM performance across diverselanguages. LLM-based evaluators seem like the perfect solution to this problem,as they do not require human annotators, human-created references, orbenchmarks and can theoretically be used to evaluate any language covered bythe LLM. In this paper, we investigate whether LLM-based evaluators can helpscale up multilingual evaluation. Specifically, we calibrate LLM-basedevaluation against 20k human judgments of five metrics across threetext-generation tasks in eight languages. Our findings indicate that LLM-basedevaluators may exhibit bias towards higher scores and should be used withcaution and should always be calibrated with a dataset of native speakerjudgments, particularly in low-resource and non-Latin script languages.

Quick Read (beta)

loading the full paper ...