MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Abstract

Although recent Large Language Models (LLMs) have shown rapid improvement onreasoning benchmarks in English, the evaluation of such LLMs' multilingualreasoning capability across diverse languages and cultural contexts remainslimited. Existing multilingual reasoning benchmarks are typically constructedby translating existing English reasoning benchmarks, biasing these benchmarkstowards reasoning problems with context in English language/cultures. In thiswork, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), abenchmark designed to assess LLMs on more than 1,000 native, linguistic andculturally grounded reasoning questions written by native speakers in French,Spanish, and Chinese. MultiNRC covers four core reasoning categories:language-specific linguistic reasoning, wordplay & riddles, cultural/traditionreasoning, and math reasoning with cultural relevance. For cultural/traditionreasoning and math reasoning with cultural relevance, we also provide Englishequivalent translations of the multilingual questions by manual translationfrom native speakers fluent in English. This set of English equivalents canprovide a direct comparison of LLM reasoning capacity in other languages vs.English on the same reasoning questions. We systematically evaluate current 14leading LLMs covering most LLM families on MultiNRC and its English equivalentset. The results show that (1) current LLMs are still not good at nativemultilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMsexhibit distinct strengths and weaknesses in handling linguistic, cultural, andlogical reasoning tasks; (3) Most models perform substantially better in mathreasoning in English compared to in original languages (+10%), indicatingpersistent challenges with culturally grounded knowledge.

Quick Read (beta)

loading the full paper ...