Domain-Specific Text Generation for Machine Translation

Abstract

Preservation of domain knowledge from the source to target is crucial in anytranslation workflow. It is common in the translation industry to receivehighly specialized projects, where there is hardly any parallel in-domain data.In such scenarios where there is insufficient in-domain data to fine-tuneMachine Translation (MT) models, producing translations that are consistentwith the relevant context is challenging. In this work, we propose a novelapproach to domain adaptation leveraging state-of-the-art pretrained languagemodels (LMs) for domain-specific data augmentation for MT, simulating thedomain characteristics of either (a) a small bilingual dataset, or (b) themonolingual source text to be translated. Combining this idea withback-translation, we can generate huge amounts of synthetic bilingual in-domaindata for both use cases. For our investigation, we use the state-of-the-artTransformer architecture. We employ mixed fine-tuning to train models thatsignificantly improve translation of in-domain texts. More specifically, inboth scenarios, our proposed methods achieve improvements of approximately 5-6BLEU and 2-3 BLEU, respectively, on the Arabic-to-English and English-to-Arabiclanguage pairs. Furthermore, the outcome of human evaluation corroborates theautomatic evaluation results.

Quick Read (beta)

loading the full paper ...