BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Abstract

We present BART, a denoising autoencoder for pretraining sequence-to-sequencemodels. BART is trained by (1) corrupting text with an arbitrary noisingfunction, and (2) learning a model to reconstruct the original text. It uses astandard Tranformer-based neural machine translation architecture which,despite its simplicity, can be seen as generalizing BERT (due to thebidirectional encoder), GPT (with the left-to-right decoder), and many othermore recent pretraining schemes. We evaluate a number of noising approaches,finding the best performance by both randomly shuffling the order of theoriginal sentences and using a novel in-filling scheme, where spans of text arereplaced with a single mask token. BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. Itmatches the performance of RoBERTa with comparable training resources on GLUEand SQuAD, achieves new state-of-the-art results on a range of abstractivedialogue, question answering, and summarization tasks, with gains of up to 6ROUGE. BART also provides a 1.1 BLEU increase over a back-translation systemfor machine translation, with only target language pretraining. We also reportablation experiments that replicate other pretraining schemes within the BARTframework, to better measure which factors most influence end-task performance.

Quick Read (beta)

loading the full paper ...