Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations fro...
Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and...
Learning widely applicable representations of words has been an active area of research for decades, including non-neura...
As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from ...
There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language ...
We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training an...
To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a s...
Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left languag...
Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on unders...
Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstrea...
The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural l...
The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs (Rajpurk...
The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists ...
The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate...
We demonstrate the importance of the deep bidirectionality of BERT by evaluating two pre-training objectives using exact...
In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models wi...
All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is adde...
Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pr...
We provide examples of the pre-training tasks in the following. Masked LM and the Masking Procedure: Assuming the unlabe...
To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as "sentences" ...
For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learn...
The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a...
The illustration of fine-tuning BERT on different tasks can be seen in Figure 4. Our task-specific models are formed by ...
Figure 5 presents MNLI Dev accuracy after fine-tuning from a checkpoint that has been pre-trained for $k$ steps. This al...
In Section 3.1, we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the mask...