A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

Abstract

We propose a language-independent approach for improving statistical machinetranslation for morphologically rich languages using a hybrid morpheme-wordrepresentation where the basic unit of translation is the morpheme, but wordboundaries are respected at all stages of the translation process. Our modelextends the classic phrase-based model by means of (1) word boundary-awaremorpheme-level phrase extraction, (2) minimum error-rate training for amorpheme-level translation model using word-level BLEU, and (3) joint scoringwith morpheme- and word-level language models. Further improvements areachieved by combining our model with the classic one. The evaluation on Englishto Finnish using Europarl (714K sentence pairs; 15.5M English words) showsstatistically significant improvements over the classic model based on BLEU andhuman judgments.

Quick Read (beta)

loading the full paper ...