ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

Abstract

As the usage of large language models for problems outside of simple textunderstanding or generation increases, assessing their abilities andlimitations becomes crucial. While significant progress has been made in thisarea over the last few years, most research has focused on benchmarkingEnglish, leaving other languages underexplored. This makes evaluating thereasoning and robustness level of language models in Ukrainian particularlychallenging. The purpose of this work is to establish a comprehensive benchmarkfor the reasoning capabilities evaluation of large language models in theUkrainian language. This paper presents the ZNO-Eval benchmark based on realexam tasks from Ukraine's standardized educational testing system: the ExternalIndependent Evaluation and the National Multi-subject Test. With single-answeroptions, multiple-choice, matching, and open-ended questions from diversesubjects, including Ukrainian language, mathematics, history, and geography,this dataset paves the way toward a thorough analysis of reasoning capabilitiesacross different domains and complexities. Evaluation of several well-knownlanguage models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large,Claude 3 Opus, and Gemini-1.5 Pro on this benchmark demonstrated thesuperiority of GPT-4o in both common knowledge reasoning and intricate languagetasks. At the same time, Gemini Pro and GPT-4 Turbo excelled in the arithmeticdomain, leading in single-answer and open-ended math problems. While all modelswere close to max performance in text-only common knowledge tasks like historyand geography, there still is a gap for Ukrainian language and math, thushighlighting the importance of developing specialized language benchmarks formore accurate assessments of model capabilities and limitations acrossdifferent languages and contexts.

Quick Read (beta)

loading the full paper ...