ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

Abstract

With the rapid development of evaluation datasets to assess LLMsunderstanding across a wide range of subjects and domains, identifying asuitable language understanding benchmark has become increasingly challenging.In this work, we explore LLM evaluation challenges for low-resource languageunderstanding and introduce ProverbEval, LLM evaluation benchmark forlow-resource languages based on proverbs to focus on low-resource languageunderstanding in culture-specific scenarios. We benchmark various LLMs andexplore factors that create variability in the benchmarking process. Weobserved performance variances of up to 50%, depending on the order in whichanswer choices were presented in multiple-choice tasks. Native language proverbdescriptions significantly improve tasks such as proverb generation,contributing to improved outcomes. Additionally, monolingual evaluationsconsistently outperformed their cross-lingual counterparts. We argue specialattention must be given to the order of choices, choice of prompt language,task variability, and generation tasks when creating LLM evaluation benchmarks.

Quick Read (beta)

loading the full paper ...