Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent

Abstract

In this work, we present an extensive study of statistical machinetranslation involving languages of the Indian subcontinent. These languages arerelated by genetic and contact relationships. We describe the similaritiesbetween Indic languages arising from these relationships. We explore howlexical and orthographic similarity among these languages can be utilized toimprove translation quality between Indic languages when limited parallelcorpora is available. We also explore how the structural correspondence betweenIndic languages can be utilized to re-use linguistic resources for English toIndic language translation. Our observations span 90 language pairs from 9Indic languages and English. To the best of our knowledge, this is the firstlarge-scale study specifically devoted to utilizing language relatedness toimprove translation between related languages.

Quick Read (beta)

loading the full paper ...