Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

Abstract

Neural machine translation (NMT) has achieved impressive performance onmachine translation task in recent years. However, in consideration ofefficiency, a limited-size vocabulary that only contains the top-N highestfrequency words are employed for model training, which leads to many rare andunknown words. It is rather difficult when translating from the low-resourceand morphologically-rich agglutinative languages, which have complex morphologyand large vocabulary. In this paper, we propose a morphological wordsegmentation method on the source-side for NMT that incorporates morphologyknowledge to preserve the linguistic and semantic information in the wordstructure while reducing the vocabulary size at training time. It can beutilized as a preprocessing tool to segment the words in agglutinativelanguages for other natural language processing (NLP) tasks. Experimentalresults show that our morphologically motivated word segmentation method isbetter suitable for the NMT model, which achieves significant improvements onTurkish-English and Uyghur-Chinese machine translation tasks on account ofreducing data sparseness and language complexity.

Quick Read (beta)

loading the full paper ...