Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

  • 2021-04-14 19:54:42
  • Zhong Zhou, Alex Waibel
  • 0

Abstract

We translate a closed text that is known in advance into a severely lowresource language by leveraging massive source parallelism. Our contribution isfour-fold. Firstly, we rank 124 source languages empirically to determine theircloseness to the low resource language and select the top few. We call thelinguistic definition of language family Family of Origin (FAMO), and we callthe empirical definition of higher-ranked languages using our metrics Family ofChoice (FAMC). Secondly, we build an Iteratively Pretrained MultilingualOrder-preserving Lexiconized Transformer (IPML) to train on ~1,000 lines(~3.5%) of low resource data. Using English as a hypothetical low resourcelanguage to translate from Spanish, we obtain a +24.7 BLEU increase over amultilingual baseline, and a +10.2 BLEU increase over our asymmetric baselinein the Bible dataset. Thirdly, we also use a real severely low resource Mayanlanguage, Eastern Pokomchi. Finally, we add an order-preserving lexiconizedcomponent to translate named entities accurately. We build a massive lexicontable for 2,939 Bible named entities in 124 source languages, and include manythat occur once and covers more than 66 severely low resource languages.Training on randomly sampled 1,093 lines of low resource data, we reach a 30.3BLEU score for Spanish-English translation testing on 30,022 lines of Bible,and a 42.8 BLEU score for Portuguese-English translation on the medical EMEAdataset.

 

Quick Read (beta)

loading the full paper ...