XLNet: Generalized Autoregressive Pretraining for Language Understanding

Abstract

With the capability of modeling bidirectional contexts, denoisingautoencoding based pretraining like BERT achieves better performance thanpretraining approaches based on autoregressive language modeling. However,relying on corrupting the input with masks, BERT neglects dependency betweenthe masked positions and suffers from a pretrain-finetune discrepancy. In lightof these pros and cons, we propose XLNet, a generalized autoregressivepretraining method that (1) enables learning bidirectional contexts bymaximizing the expected likelihood over all permutations of the factorizationorder and (2) overcomes the limitations of BERT thanks to its autoregressiveformulation. Furthermore, XLNet integrates ideas from Transformer-XL, thestate-of-the-art autoregressive model, into pretraining. Empirically, undercomparable experiment settings, XLNet outperforms BERT on 20 tasks, often by alarge margin, including question answering, natural language inference,sentiment analysis, and document ranking.

Quick Read (beta)

loading the full paper ...