Phonology-Augmented Statistical Framework for Machine Transliteration using Limited Linguistic Resources

Abstract

Transliteration converts words in a source language (e.g., English) intowords in a target language (e.g., Vietnamese). This conversion considers thephonological structure of the target language, as the transliterated outputneeds to be pronounceable in the target language. For example, a word inVietnamese that begins with a consonant cluster is phonologically invalid andthus would be an incorrect output of a transliteration system. Most statisticaltransliteration approaches, albeit being widely adopted, do not explicitlymodel the target language's phonology, which often results in invalid outputs.The problem is compounded by the limited linguistic resources available whenconverting foreign words to transliterated words in the target language. Inthis work, we present a phonology-augmented statistical framework suitable fortransliteration, especially when only limited linguistic resources areavailable. We propose the concept of pseudo-syllables as structuresrepresenting how segments of a foreign word are organized according to thesyllables of the target language's phonology. We performed transliterationexperiments on Vietnamese and Cantonese. We show that the proposed frameworkoutperforms the statistical baseline by up to 44.68% relative, when there arelimited training examples (587 entries).

Quick Read (beta)

loading the full paper ...