Abstract
Models that perform well on a training domain often fail to generalize toout-of-domain (OOD) examples. Data augmentation is a common method used toprevent overfitting and improve OOD generalization. However, in naturallanguage, it is difficult to generate new examples that stay on the underlyingdata manifold. We introduce SSMBA, a data augmentation method for generatingsynthetic training examples by using a pair of corruption and reconstructionfunctions to move randomly on a data manifold. We investigate the use of SSMBAin the natural language domain, leveraging the manifold assumption toreconstruct corrupted text with masked language models. In experiments onrobustness benchmarks across 3 tasks and 9 datasets, SSMBA consistentlyoutperforms existing data augmentation methods and baseline models on bothin-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews,1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.