Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Abstract

To date, there exist almost no culturally-specific evaluation benchmarks forlarge language models (LLMs) that cover a large number of languages andcultures. In this paper, we present Global PIQA, a participatory commonsensereasoning benchmark for over 100 languages, constructed by hand by 335researchers from 65 countries around the world. The 116 language varieties inGlobal PIQA cover five continents, 14 language families, and 23 writingsystems. In the non-parallel split of Global PIQA, over 50% of examplesreference local foods, customs, traditions, or other culturally-specificelements. We find that state-of-the-art LLMs perform well on Global PIQA inaggregate, but they exhibit weaker performance in lower-resource languages (upto a 37% accuracy gap, despite random chance at 50%). Open models generallyperform worse than proprietary models. Global PIQA highlights that in manylanguages and cultures, everyday knowledge remains an area for improvement,alongside more widely-discussed capabilities such as complex reasoning andexpert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQAprovides a glimpse into the wide diversity of cultures in which human languageis embedded.

Quick Read (beta)

loading the full paper ...