Abstract
Evaluating text generation capabilities of large language models (LLMs) ischallenging, particularly for low-resource languages where methods for directassessment are scarce. We propose MUG-Eval, a novel framework that evaluatesLLMs' multilingual generation capabilities by transforming existing benchmarksinto conversational tasks and measuring the LLMs' accuracies on those tasks. Wespecifically designed these conversational tasks to require effectivecommunication in the target language. Then, we simply use task success rate asa proxy of successful conversation generation. Our approach offers two keyadvantages: it is independent of language-specific NLP tools or annotateddatasets, which are limited for most languages, and it does not rely onLLMs-as-judges, whose evaluation quality degrades outside a few high-resourcelanguages. We evaluate 8 LLMs across 30 languages spanning high, mid, andlow-resource categories, and we find that MUG-Eval correlates strongly withestablished benchmarks ($r$ > 0.75) while enabling standardized comparisonsacross languages and models. Our framework provides a robust andresource-efficient solution for evaluating multilingual generation that can beextended to thousands of languages.