Abstract
Language technologies should be judged on their usefulness in real-world usecases. An often overlooked aspect in natural language processing (NLP) researchand evaluation is language variation in the form of non-standard dialects orlanguage varieties (hereafter, varieties). Most NLP benchmarks are limited tostandard language varieties. To fill this gap, we propose DIALECTBENCH, thefirst-ever large-scale benchmark for NLP on varieties, which aggregates anextensive set of task-varied variety datasets (10 text-level tasks covering 281varieties). This allows for a comprehensive evaluation of NLP systemperformance on different language varieties. We provide substantial evidence ofperformance disparities between standard and non-standard language varieties,and we also identify language clusters with large performance divergence acrosstasks. We believe DIALECTBENCH provides a comprehensive view of the currentstate of NLP for language varieties and one step towards advancing it further.Code/data: https://github.com/ffaisal93/DialectBench