BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Abstract

We introduce a new language representation model called BERT, which standsfor Bidirectional Encoder Representations from Transformers. Unlike recentlanguage representation models, BERT is designed to pre-train deepbidirectional representations from unlabeled text by jointly conditioning onboth left and right context in all layers. As a result, the pre-trained BERTmodel can be fine-tuned with just one additional output layer to createstate-of-the-art models for a wide range of tasks, such as question answeringand language inference, without substantial task-specific architecturemodifications. BERT is conceptually simple and empirically powerful. It obtains newstate-of-the-art results on eleven natural language processing tasks, includingpushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLIaccuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answeringTest F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1(5.1 point absolute improvement).

Quick Read (beta)

loading the full paper ...