COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

Abstract

We present COCO-LM, a new self-supervised learning framework that pretrainsLanguage Models by COrrecting challenging errors and COntrasting textsequences. COCO-LM employs an auxiliary language model to mask-and-predicttokens in original text sequences. It creates more challenging pretraininginputs, where noises are sampled based on their likelihood in the auxiliarylanguage model. COCO-LM then pretrains with two tasks: The first task,corrective language modeling, learns to correct the auxiliary model'scorruptions by recovering the original tokens. The second task, sequencecontrastive learning, ensures that the language model generates sequencerepresentations that are invariant to noises and transformations. In ourexperiments on the GLUE and SQuAD benchmarks, COCO-LM outperforms recentpretraining approaches in various pretraining settings and few-shotevaluations, with higher pretraining efficiency. Our analyses reveal thatCOCO-LM's advantages come from its challenging training signals, morecontextualized token representations, and regularized sequence representations.

Quick Read (beta)

loading the full paper ...