Cross-Lingual Word Embeddings for Turkic Languages

Abstract

There has been an increasing interest in learning cross-lingual wordembeddings to transfer knowledge obtained from a resource-rich language, suchas English, to lower-resource languages for which annotated data is scarce,such as Turkish, Russian, and many others. In this paper, we present the firstviability study of established techniques to align monolingual embedding spacesfor Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic familywhich is heavily affected by the low-resource constraint. Those techniques areknown to require little explicit supervision, mainly in the form of bilingualdictionaries, hence being easily adaptable to different domains, includinglow-resource ones. We obtain new bilingual dictionaries and new word embeddingsfor these languages and show the steps for obtaining cross-lingual wordembeddings using state-of-the-art techniques. Then, we evaluate the resultsusing the bilingual dictionary induction task. Our experiments confirm that theobtained bilingual dictionaries outperform previously-available ones, and thatword embeddings from a low-resource language can benefit from resource-richclosely-related languages when they are aligned together. Furthermore,evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves thatmonolingual word embeddings can, although slightly, benefit from cross-lingualalignments.

Quick Read (beta)

loading the full paper ...