Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

  • 2021-04-12 22:32:58
  • Zhong Zhou, Alex Waibel
  • 0

Abstract

We translate a closed text that is known in advance into a severely lowresource language by leveraging massive source parallelism. Our contribution isfour-fold. Firstly, we rank 124 source languages empirically to determine theircloseness to the low resource language and select the top few. We call thelinguistic definition of language family Family of Origin (FAMO), and we callthe empirical definition of higher-ranked languages using our metrics Family ofChoice (FAMC). Secondly, we build an Iteratively Pretrained MultilingualOrder-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines(~3.5\%) of low resource data from the Bible dataset and the medical EMEAdataset. Using English as a hypothetical low resource language to translatefrom Spanish, we obtain a +24.7 BLEU increase over a multilingual baseline, anda +10.2 BLEU increase over our asymmetric baseline. Thirdly, we also use a realseverely low resource Mayan language, Eastern Pokomchi. Finally, we add anorder-preserving lexiconized component to translate named entities accurately.We build a massive lexicon table for 2,939 Bible named entities in 124 sourcelanguages, and include many that occur once and covers more than 66 severelylow resource languages. Training on randomly sampled 1,093 lines of lowresource data, we reach a 30.3 BLEU score for Spanish-English translationtesting on 30,022 lines of Bible, and a 42.8 BLEU score for Portuguese-Englishtranslation on the medical EMEA dataset.

 

Quick Read (beta)

loading the full paper ...