Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

  • 2024-12-27 08:25:52
  • Surangika Ranathungaa, Shravan Nayak, Shih-Ting Cindy Huang, Yanke Mao, Tong Su, Yun-Hsiang Ray Chan, Songchen Yuan, Anthony Rinaldi, Annie En-Shiun Lee
  • 0

Abstract

Neural Machine Translation (NMT) systems built on multilingualsequence-to-sequence Language Models (msLMs) fail to deliver expected resultswhen the amount of parallel data for a language, as well as the language'srepresentation in the model are limited. This restricts the capabilities ofdomain-specific NMT systems for low-resource languages (LRLs). As a solution,parallel data from auxiliary domains can be used either to fine-tune or tofurther pre-train the msLM. We present an evaluation of the effectiveness ofthese two techniques in the context of domain-specific LRL-NMT. We also explorethe impact of domain divergence on NMT model performance. We recommend severalstrategies for utilizing auxiliary parallel data in building domain-specificNMT models for LRLs.

 

Quick Read (beta)

loading the full paper ...