One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Abstract

Language is not monolithic. While benchmarks, including those designed formultiple languages, are often used as proxies to evaluate the performance ofLarge Language Models (LLMs), they tend to overlook the nuances ofwithin-language variation, and thus fail to model the experience of speakers ofnon-standard dialects. Focusing on African American Vernacular English (AAVE),we present the first study aimed at objectively assessing the fairness androbustness of LLMs in handling dialects in canonical reasoning tasks, includingalgorithm, math, logic, and integrated reasoning. We introduce \textbf{ReDial}(\textbf{Re}asoning with \textbf{Dial}ect Queries), a benchmark containing1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVEspeakers, including experts with computer science backgrounds, to rewrite sevenpopular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluatewidely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi modelfamilies. Our findings reveal that \textbf{almost all of these widely usedmodels show significant brittleness and unfairness to queries in AAVE}. Ourwork establishes a systematic and objective framework for analyzing LLM bias indialectal queries. Moreover, it highlights how mainstream LLMs provide unfairservice to dialect speakers in reasoning tasks, laying a critical foundationfor relevant future research. Code and data can be accessed athttps://github.com/fangru-lin/redial_dialect_robustness_fairness.

Quick Read (beta)

loading the full paper ...