Best Practices for Learning Domain-Specific Cross-Lingual Embeddings

Abstract

Cross-lingual embeddings aim to represent words in multiple languages in ashared vector space by capturing semantic similarities across languages. Theyare a crucial component for scaling tasks to multiple languages by transferringknowledge from languages with rich resources to low-resource languages. Acommon approach to learning cross-lingual embeddings is to train monolingualembeddings separately for each language and learn a linear projection from themonolingual spaces into a shared space, where the mapping relies on a smallseed dictionary. While there are high-quality generic seed dictionaries andpre-trained cross-lingual embeddings available for many language pairs, thereis little research on how they perform on specialised tasks. In this paper, weinvestigate the best practices for constructing the seed dictionary for aspecific domain. We evaluate the embeddings on the sequence labelling task ofCurriculum Vitae parsing and show that the size of a bilingual dictionary, thefrequency of the dictionary words in the domain corpora and the source of data(task-specific vs generic) influence the performance. We also show that theless training data is available in the low-resource language, the more theconstruction of the bilingual dictionary matters, and demonstrate that some ofthe choices are crucial in the zero-shot transfer learning case.

Quick Read (beta)

loading the full paper ...