CaLMQA: Exploring culturally specific long-form question answering across 23 languages

Abstract

Large language models (LLMs) are used for long-form question answering(LFQA), which requires them to generate paragraph-length answers to complexquestions. While LFQA has been well-studied in English, this research has notbeen extended to other languages. To bridge this gap, we introduce CaLMQA, acollection of 1.5K complex culturally specific questions spanning 23 languagesand 51 culturally agnostic questions translated from English into 22 otherlanguages. We define culturally specific questions as those uniquely or morelikely to be asked by people from cultures associated with the question'slanguage. We collect naturally-occurring questions from community web forumsand hire native speakers to write questions to cover under-resourced,rarely-studied languages such as Fijian and Kirundi. Our dataset containsdiverse, complex questions that reflect cultural topics (e.g. traditions, laws,news) and the language usage of native speakers. We automatically evaluate asuite of open- and closed-source models on CaLMQA by detecting incorrectlanguage and token repetitions in answers, and observe that the quality ofLLM-generated answers degrades significantly for some low-resource languages.Lastly, we perform human evaluation on a subset of models and languages. Manualevaluation reveals that model performance is significantly worse for culturallyspecific questions than for culturally agnostic questions. Our findingshighlight the need for further research in non-English LFQA and provide anevaluation framework.

Quick Read (beta)

loading the full paper ...