ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

  • 2020-10-09 20:28:06
  • Shiyue Zhang, Benjamin Frey, Mohit Bansal
  • 1

Abstract

Cherokee is a highly endangered Native American language spoken by theCherokee people. The Cherokee culture is deeply embedded in its language.However, there are approximately only 2,000 fluent first language Cherokeespeakers remaining in the world, and the number is declining every year. Tohelp save this endangered language, we introduce ChrEn, a Cherokee-Englishparallel dataset, to facilitate machine translation research between Cherokeeand English. Compared to some popular machine translation language pairs, ChrEnis extremely low-resource, only containing 14k sentence pairs in total. Wesplit our parallel data in ways that facilitate both in-domain andout-of-domain evaluation. We also collect 5k Cherokee monolingual data toenable semi-supervised learning. Besides these datasets, we propose severalCherokee-English and English-Cherokee machine translation systems. We compareSMT (phrase-based) versus NMT (RNN-based and Transformer-based) systems;supervised versus semi-supervised (via language model, back-translation, andBERT/Multilingual-BERT) methods; as well as transfer learning versusmultilingual joint training with 4 other languages. Our best results are15.8/12.7 BLEU for in-domain and 6.5/5.0 BLEU for out-of-domain Chr-En/EnChrtranslations, respectively, and we hope that our dataset and systems willencourage future work by the community for Cherokee language revitalization.Our data, code, and demo will be publicly available athttps://github.com/ZhangShiyue/ChrEn

 

Quick Read (beta)

loading the full paper ...