MultiZebraLogic: A Multilingual Logical Reasoning Benchmark

Abstract

Measuring the full abilities of large language models (LLMs) requiresbenchmarks representing multiple tasks. We aim to create large, high-qualitydatasets for comparison of logical reasoning skills across several languagesand of suitable difficulty for LLMs of various reasoning ability. We exploremultiple ways of increasing difficulty. We generate zebra puzzles in multiplelanguages, themes, sizes and including 14 different clue types and 8 redherring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 aresufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (areasoning model), respectively. Including 5 red herrings decreases o3-minipuzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5puzzles are not significantly affected by use of English vs. Danish or thecommon houses theme vs. the country-specific smoerrebroed theme. We find nocorrelation between difficulty and the selected clue types. Datasets of128+1024 puzzles are published as MultiZebraLogic in each of nine Germaniclanguages for sizes 2x3 and 4x5. We publish code for puzzle generation,designed for adaptablity into more languages and themes.

Quick Read (beta)

loading the full paper ...