Abstract
With the rapid development of evaluation datasets to assess LLMsunderstanding across a wide range of subjects and domains, identifying asuitable language understanding benchmark has become increasingly challenging.In this work, we explore LLM evaluation challenges for low-resource languageunderstanding and introduce ProverbEval, LLM evaluation benchmark forlow-resource languages based on proverbs to focus on low-resource languageunderstanding in culture-specific scenarios. We benchmark various LLMs andexplore factors that create variability in the benchmarking process. Weobserved performance variances of up to 50%, depending on the order in whichanswer choices were presented in multiple-choice tasks. Native language proverbdescriptions significantly improve tasks such as proverb generation,contributing to improved outcomes. Additionally, monolingual evaluationsconsistently outperformed their cross-lingual counterparts. We argue specialattention must be given to the order of choices, choice of prompt language,task variability, and generation tasks when creating LLM evaluation benchmarks.