Bridging the domain gap in cross-lingual document classification

Abstract

The scarcity of labeled training data often prohibits theinternationalization of NLP models to multiple languages. Recent developmentsin cross-lingual understanding (XLU) has made progress in this area, trying tobridge the language barrier using language universal representations. However,even if the language problem was resolved, models trained in one language wouldnot transfer to another language perfectly due to the natural domain driftacross languages and cultures. We consider the setting of semi-supervisedcross-lingual understanding, where labeled data is available in a sourcelanguage (English), but only unlabeled data is available in the targetlanguage. We combine state-of-the-art cross-lingual methods with recentlyproposed methods for weakly supervised learning such as unsupervisedpre-training and unsupervised data augmentation to simultaneously close boththe language gap and the domain gap in XLU. We show that addressing the domaingap is crucial. We improve over strong baselines and achieve a newstate-of-the-art for cross-lingual document classification.

Quick Read (beta)

loading the full paper ...