DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

  • 2025-01-27 23:53:04
  • Niyati Bafna, Emily Chang, Nathaniel R. Robinson, David R. Mortensen, Kenton Murray, David Yarowsky, Hale Sirin
  • 0


Most of the world's languages and dialects are low-resource, and lack supportin mainstream machine translation (MT) models. However, many of them have aclosely-related high-resource language (HRL) neighbor, and differ inlinguistically regular ways from it. This underscores the importance of modelrobustness to dialectical variation and cross-lingual generalization to the HRLdialect continuum. We present DialUp, consisting of a training-time techniquefor adapting a pretrained model to dialectical data (M->D), and aninference-time intervention adapting dialectical data to the model expertise(D->M). M->D induces model robustness to potentially unseen and unknowndialects by exposure to synthetic data exemplifying linguistic mechanisms ofdialectical variation, whereas D->M treats dialectical divergence for knowntarget dialects. These methods show considerable performance gains for severaldialects from four language families, and modest gains for two other languagefamilies. We also conduct feature and error analyses, which show that languagevarieties with low baseline MT performance are more likely to benefit fromthese approaches.


Quick Read (beta)

loading the full paper ...