AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Abstract

Arabic, with its rich diversity of dialects, remains significantlyunderrepresented in Large Language Models, particularly in dialectalvariations. We address this gap by introducing seven synthetic datasets indialects alongside Modern Standard Arabic (MSA), created using MachineTranslation (MT) combined with human post-editing. We present AraDiCE, abenchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs ondialect comprehension and generation, focusing specifically on low-resourceArabic dialects. Additionally, we introduce the first-ever fine-grainedbenchmark designed to evaluate cultural awareness across the Gulf, Egypt, andLevant regions, providing a novel dimension to LLM evaluation. Our findingsdemonstrate that while Arabic-specific models like Jais and AceGPT outperformmultilingual models on dialectal tasks, significant challenges persist indialect identification, generation, and translation. This work contributes ~45Kpost-edited samples, a cultural benchmark, and highlights the importance oftailored training to improve LLM performance in capturing the nuances ofdiverse Arabic dialects and cultural contexts. We will release the dialectaltranslation models and benchmarks curated in this study.

Quick Read (beta)

loading the full paper ...