Abstract
Recently, the pre-trained language model, BERT (and its robustly optimizedversion RoBERTa), has attracted a lot of attention in natural languageunderstanding (NLU), and achieved state-of-the-art accuracy in various NLUtasks, such as sentiment classification, natural language inference, semantictextual similarity and question answering. Inspired by the linearizationexploration work of Elman [8], we extend BERT to a new model, StructBERT, byincorporating language structures into pre-training. Specifically, we pre-trainStructBERT with two auxiliary tasks to make the most of the sequential order ofwords and sentences, which leverage language structures at the word andsentence levels, respectively. As a result, the new model is adapted todifferent levels of language understanding required by downstream tasks. TheStructBERT with structural pre-training gives surprisingly good empiricalresults on a variety of downstream tasks, including pushing thestate-of-the-art on the GLUE benchmark to 89.0 (outperforming all publishedmodels), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy onSNLI to 91.7.