Abstract
Evaluating machine-generated text remains a significant challenge in NLP,especially for non-English languages. Current methodologies, includingautomated metrics, human assessments, and LLM-based evaluations, predominantlyfocus on English, revealing a significant gap in multilingual evaluationframeworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, anextensible framework that includes evaluator LLMs (Hercule) and a novel testset (Recon) specifically designed for multilingual evaluation. Our test setfeatures 500 human-annotated instructions spanning various task capabilitiesalong with human judgment scores across six languages. This would enablebenchmarking of general-purpose multilingual LLMs and facilitatemeta-evaluation of Evaluator LLMs. The proposed model, Hercule, is across-lingual evaluation model that addresses the scarcity of reference answersin the target language by learning to assign scores to responses based oneasily available reference answers in English. Our experiments demonstrate thatHercule aligns more closely with human judgments compared to proprietarymodels, demonstrating the effectiveness of such cross-lingual evaluation in lowresource scenarios. Further, it is also effective in zero-shot evaluation onunseen languages. This study is the first comprehensive examination ofcross-lingual evaluation using LLMs, presenting a scalable and effectiveapproach for multilingual assessment. All code, datasets, and models will bepublicly available to enable further research in this important area.