Machine Translation by Projecting Text into the Same Phonetic-Orthographic Space Using a Common Encoding

Abstract

The use of subword embedding has proved to be a major innovation in NeuralMachine Translation (NMT). It helps NMT to learn better context vectors for LowResource Languages (LRLs) so as to predict the target words by better modellingthe morphologies of the two languages and also the morphosyntax transfer. Evenso, their performance for translation in Indian language to Indian languagescenario is still not as good as for resource-rich languages. One reason forthis is the relative morphological richness of Indian languages, while anotheris that most of them fall into the extremely low resource or zero-shotcategories. Since most major Indian languages use Indic or Brahmi originscripts, the text written in them is highly phonetic in nature and phoneticallysimilar in terms of abstract letters and their arrangements. We use thesecharacteristics of Indian languages and their scripts to propose an approachbased on common multilingual Latin-based encodings (WX notation) that takeadvantage of language similarity while addressing the morphological complexityissue in NMT. These multilingual Latin-based encodings in NMT, together withByte Pair Embedding (BPE) allow us to better exploit their phonetic andorthographic as well as lexical similarities to improve the translation qualityby projecting different but similar languages on the same orthographic-phoneticcharacter space. We verify the proposed approach by demonstrating experimentson similar language pairs (Gujarati-Hindi, Marathi-Hindi, Nepali-Hindi,Maithili-Hindi, Punjabi-Hindi, and Urdu-Hindi) under low resource conditions.The proposed approach shows an improvement in a majority of cases, in one caseas much as ~10 BLEU points compared to baseline techniques for similar languagepairs. We also get up to ~1 BLEU points improvement on distant and zero-shotlanguage pairs.

Quick Read (beta)

loading the full paper ...