CaLMQA: Exploring culturally specific long-form question answering across 23 languages

Abstract

Despite rising global usage of large language models (LLMs), their ability togenerate long-form answers to culturally specific questions remains unexploredin many languages. To fill this gap, we perform the first study of textualmultilingual long-form QA by creating CaLMQA, a dataset of 51.7K culturallyspecific questions across 23 different languages. We define culturally specificquestions as those that refer to concepts unique to one or a few cultures, orhave different answers depending on the cultural or regional context. We obtainthese questions by crawling naturally-occurring questions from community webforums in high-resource languages, and by hiring native speakers to writequestions in under-resourced, rarely-studied languages such as Fijian andKirundi. Our data collection methodologies are translation-free, enabling thecollection of culturally unique questions like "Kuber iki umwami wa mberew'uburundi yitwa Ntare?" (Kirundi; English translation: "Why was the first kingof Burundi called Ntare (Lion)?"). We evaluate factuality, relevance andsurface-level quality of LLM-generated long-form answers, finding that (1) formany languages, even the best models make critical surface-level errors (e.g.,answering in the wrong language, repetition), especially for low-resourcelanguages; and (2) answers to culturally specific questions contain morefactual errors than answers to culturally agnostic questions -- questions thathave consistent meaning and answer across many cultures. We release CaLMQA tofacilitate future research in cultural and multilingual long-form QA.

Quick Read (beta)

loading the full paper ...