Natural Language Processing for Dialects of a Language: A Survey

Abstract

State-of-the-art natural language processing (NLP) models are trained onmassive training corpora, and report a superlative performance on evaluationdatasets. This survey delves into an important attribute of these datasets: thedialect of a language. Motivated by the performance degradation of NLP modelsfor dialectic datasets and its implications for the equity of languagetechnologies, we survey past research in NLP for dialects in terms of datasets,and approaches. We describe a wide range of NLP tasks in terms of twocategories: natural language understanding (NLU) (for tasks such as dialectclassification, sentiment analysis, parsing, and NLU benchmarks) and naturallanguage generation (NLG) (for summarisation, machine translation, and dialoguesystems). The survey is also broad in its coverage of languages which includeEnglish, Arabic, German among others. We observe that past work in NLPconcerning dialects goes deeper than mere dialect classification, and . Thisincludes early approaches that used sentence transduction that lead to therecent approaches that integrate hypernetworks into LoRA. We expect that thissurvey will be useful to NLP researchers interested in building equitablelanguage technologies by rethinking LLM benchmarks and model architectures.

Quick Read (beta)

loading the full paper ...