Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

Abstract

Neural Machine Translation (NMT) systems built on multilingualsequence-to-sequence Language Models (msLMs) fail to deliver expected resultswhen the amount of parallel data for a language, as well as the language'srepresentation in the model are limited. This restricts the capabilities ofdomain-specific NMT systems for low-resource languages (LRLs). As a solution,parallel data from auxiliary domains can be used either to fine-tune or tofurther pre-train the msLM. We present an evaluation of the effectiveness ofthese two techniques in the context of domain-specific LRL-NMT. We also explorethe impact of domain divergence on NMT model performance. We recommend severalstrategies for utilizing auxiliary parallel data in building domain-specificNMT models for LRLs.

Quick Read (beta)

loading the full paper ...