MultiMix: A Robust Data Augmentation Framework for Cross-Lingual NLP

Abstract

Transfer learning has yielded state-of-the-art (SoTA) results in manysupervised natural language processing tasks. However, annotated data for everytarget task in every target language is rare, especially for low-resourcelanguages. We propose MultiMix, a novel data augmentation framework forself-supervised learning in zero-resource transfer learning scenarios. Inparticular, MultiMix targets to solve cross-lingual adaptation problems from asource language distribution to an unknown target language distribution,assuming no training labels are available for the target language task. At itscore, MultiMix performs simultaneous self-training with data augmentation andunsupervised sample selection. To show its effectiveness, we conduct extensiveexperiments on zero-resource cross-lingual transfer tasks for Named EntityRecognition and Natural Language Inference. MultiMix achieves SoTA results inboth tasks, outperforming the baselines by a good margin. With an in-depthmodel dissection, we demonstrate the cumulative contributions of differentcomponents to MultiMix's success.

Quick Read (beta)

loading the full paper ...