Abstract
BERT adopts masked language modeling (MLM) for pre-training and is one of themost successful pre-training models. Since BERT neglects dependency amongpredicted tokens, XLNet introduces permuted language modeling (PLM) forpre-training to address this problem. We argue that XLNet does not leverage thefull position information of a sentence and thus suffers from positiondiscrepancy between pre-training and fine-tuning. In this paper, we proposeMPNet, a novel pre-training method that inherits the advantages of BERT andXLNet and avoids their limitations. MPNet leverages the dependency amongpredicted tokens through permuted language modeling (vs. MLM in BERT), andtakes auxiliary position information as input to make the model see a fullsentence and thus reducing the position discrepancy (vs. PLM in XLNet). Wepre-train MPNet on a large-scale dataset (over 160GB text corpora) andfine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimentalresults show that MPNet outperforms MLM and PLM by a large margin, and achievesbetter results on these tasks compared with previous state-of-the-artpre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.We release the code and pre-trained model inGitHub\footnote{\url{https://github.com/microsoft/MPNet}}.