Anchor-based Bilingual Word Embeddings for Low-Resource Languages

Abstract

Good quality monolingual word embeddings (MWEs) can be built for languageswhich have large amounts of unlabeled text. MWEs can be aligned to bilingualspaces using only a few thousand word translation pairs. For low resourcelanguages training MWEs monolingually results in MWEs of poor quality, and thuspoor bilingual word embeddings (BWEs) as well. This paper proposes a newapproach for building BWEs in which the vector space of the high resourcesource language is used as a starting point for training an embedding space forthe low resource target language. By using the source vectors as anchors thevector spaces are automatically aligned during training. We experiment onEnglish-German, English-Hiligaynon and English-Macedonian. We show that ourapproach results not only in improved BWEs and bilingual lexicon inductionperformance, but also in improved target language MWE quality as measured usingmonolingual word similarity.

Quick Read (beta)

loading the full paper ...