BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Abstract

p.1

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations fro...

1 Introduction

p.1

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and...

2.1 Unsupervised Feature-based Approaches

p.2

Learning widely applicable representations of words has been an active area of research for decades, including non-neura...

2.2 Unsupervised Fine-tuning Approaches

p.2

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from ...

2.3 Transfer Learning from Supervised Data

p.3

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language ...

3 BERT — Overview and Model Architecture

p.3

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training an...

3 BERT — Input/Output Representations

p.4

To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a s...

3.1 Pre-training BERT — Masked LM

p.4

Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left languag...

3.1 Pre-training BERT — Next Sentence Prediction

p.4

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on unders...

3.2 Fine-tuning BERT

p.5

Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstrea...

4.1 GLUE

p.5

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural l...

4.2 SQuAD v1.1

Mathp.6

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs (Rajpurk...

4.3 SQuAD v2.0

p.7

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists ...

4.4 SWAG

p.7

The Situations With Adversarial Generations (SWAG) dataset contains 113k sentence-pair completion examples that evaluate...

5.1 Effect of Pre-training Tasks

p.8

We demonstrate the importance of the deep bidirectionality of BERT by evaluating two pre-training objectives using exact...

5.2 Effect of Model Size

p.8

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models wi...

5.3 Feature-based Approach with BERT

p.9

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is adde...

6 Conclusion

p.9

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pr...

A.1 Illustration of the Pre-training Tasks

p.12

We provide examples of the pre-training tasks in the following. Masked LM and the Masking Procedure: Assuming the unlabe...

A.2 Pre-training Procedure

p.13

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as "sentences" ...

A.3 Fine-tuning Procedure

p.13

For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learn...

A.4 Comparison of BERT, ELMo, and OpenAI GPT

p.14

The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a...

A.5 Illustrations of Fine-tuning on Different Tasks

p.14

The illustration of fine-tuning BERT on different tasks can be seen in Figure 4. Our task-specific models are formed by ...

C.1 Effect of Number of Training Steps

p.16

Figure 5 presents MNLI Dev accuracy after fine-tuning from a checkpoint that has been pre-trained for $k$ steps. This al...

C.2 Ablation for Different Masking Procedures

p.16

In Section 3.1, we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the mask...