Abstract
Large language models (LLMs) often lack culture-specific knowledge of dailylife, especially across diverse regions and non-English languages. Existingbenchmarks for evaluating LLMs' cultural sensitivities are limited to a singlelanguage or collected from online sources such as Wikipedia, which do notreflect the mundane everyday lifestyles of diverse regions. That is,information about the food people eat for their birthday celebrations, spicesthey typically use, musical instruments youngsters play, or the sports theypractice in school is common cultural knowledge but uncommon in easilycollected online sources, especially for underrepresented cultures. To addressthis issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluateLLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises52.6k question-answer pairs from 16 countries/regions, in 13 differentlanguages, including low-resource ones such as Amharic, Assamese, Azerbaijani,Hausa, and Sundanese. We construct the benchmark to include two formats ofquestions: short-answer and multiple-choice. We show that LLMs perform betterfor cultures that are highly represented online, with a maximum 57.34%difference in GPT-4, the best-performing model, in the short-answer format. Forcultures represented by mid-to-high-resource languages, LLMs perform better intheir local languages, but for cultures represented by low-resource languages,LLMs perform better in English than the local languages. We make our datasetpublicly available at: https://github.com/nlee0212/BLEnD.