Abstract: We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.

–

Paper Summary

Paper Contributions

We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance
We use a novel dataset, CC- NEWS, and confirm that using more data for pre-training further improves performance on down-stream tasks
Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods

Brief Overview of BERT

Setup

BERT takes as input a concatenation of two segments (sequences of tokens). Segments usually consist of more than one natural sentence.
The two segments are presented as a single input sequence to BERT with special tokens delimiting them: [CLS], x1, . . . , xN , [SEP], y1, . . . , yM , [EOS]

Training Objectives

Masked Language Modeling (MLM): A random sample of the tokens in the input sequence is selected and replaced with the special token [MASK]. The MLM objective is a cross-entropy loss on predicting the masked tokens.
Next Sentence Prediction (NSP): a binary classification loss for predicting whether two segments follow each other in the original text. Positive and negative examples are sampled with equal probability.

Optimization: Adam

Data: 16GB of uncompressed text, a combination of BOOKCORPUS (Zhu et al., 2015) and English WIKIPEDIA

Experimental Setup

Implmentation

Optimization hyperparameters kept same as original BERT, except for the peak learning rate and number of warmup steps, which are tuned separately for each setting.
Unlike Devlin et al. (2019), we do not randomly inject short sequences, and we do not train with a reduced sequence length for the first 90% of updates. We train only with full-length sequences
We train with mixed precision floating point arithmetic on DGX-1 machines, each with 8 × 32GB Nvidia V100 GPUs interconnected by Infiniband (Micikevicius et al., 2018)

Data Five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text

BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA. This is the original data used to train BERT. (16GB)
CC-NEWS, which we collected from the English portion of the CommonCrawl News dataset (Nagel, 2016). (76GB after filtering)
PENWEBTEXT (Gokaslan and Cohen, 2019) (38GB)
STORIES (31GB)

Evaluation: GLUE, SQuAD, RACE

Training Procedure Analysis

Static vs. Dynamic Masking

Static Masking: masking is performed once during data preprocessing. Training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.
Dynamic Masking: generate the masking pattern every time we feed a sequence to the model, which is important for larger datasets to enusre the model does not see the same mask during the course of training.
Result: dynamic masking is comparable or slightly better than static masking

Model Input Format and NSP

Segment-Pair + NSP (BERT Implementation): Each input has a pair of segments, which can each contain multiple natural sentences, but the total combined length must be less than 512 tokens.
Sentence-Pair + NSP: Each input contains a pair of natural sentences, either sampled from a contiguous portion of one document or from separate documents. Since these inputs are significantly shorter than 512 tokens, we increase the batch size so that the total number of tokens remains similar to SEGMENT-PAIR+NSP
Full Sentences (no NSP loss): Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens.
Doc Sentences (no NSP loss): Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries.
Result: removing NSP matche or improves performance; DOC-SENT performs slightly better than FULL-SENT but FULL-SENT is used bc DOC-SENT results in varying batch sizes

Training with Larger Batches

Devlin et al. (2019) originally trained BERTBASE for 1M steps with a batch size of 256 sequences. This is equivalent in computa- tional cost, via gradient accumulation, to training for 125K steps with a batch size of 2K sequences, or for 31K steps with a batch size of 8K.
Result: train with batches of 8K

Text Encoding

Original BERT: character-level BPE vocab of 30K
RoBERTa: byte-level BPE vocab of 50K
Results: byte-level BPE achieves slightly worse end-task performance on some tasks but still used bc advantages of a universal encoding scheme outweighs the minor degredation in performance and use this encoding in the remainder of our experiments.

RoBERTa

Results I: Results

Chart above shows effects of training with more data/training for longer periods of time

Results II (GLUE): GLUE

Ensembles on test: finetune starting from MNLI single-task model
Outperforms both BERT although it uses basically the same arcitecture -> implications about pre-training objectives and data?
Outperforms or matches performances of all other models although RoBERTa depends only on single-task finetuning, while all other models depend on multi-task finetuning.

Results III (SQuAD): SQUAD

RoBERTa only finetuned on SQuAD training data while other models were finetuned on additional external training data

My Thoughts

전에도 읽었던 논문이고 이 분야 기준으로 조금 오래된 논문이지만, 현재 나오고 있는 Longformer 이나 Big Bird의 베이스 모델로 사용되고 있기 때문에 한 번 더 읽고 정리를 해봤다
다시 읽으니까 논문도 잘 썼고, 모델도 굉장히 간결하고 elegant하다는 생각이 들었다. 간단하지만 성능을 크게 향상 시키는 성과를 이뤄낸 것이 대단하다. 저자들이 모델 architecture에 대해서 많은 고민을 했기 때문에 큰 성과를 얻지 않았을까 라는 생각이 든다.
또 model stability나 global application을 위해 성능을 포기한 점도 인상적이었다 (아무리 조금이어도…). 만약에 DOC-SENT input을 사용하고 char-level BPE로 tokenization을 했으면 성능이 많이 향상되었을까?라는 생각도 하게 된다.
We also tend to assume that more training data will always translate into better performing models. However, the authors of RoBERTa prove that model architecture changes have far greater impact on performance. RoBERTa pretrained with the same data as BERT achieved as much as about 5% improvements in some tasks.
논문을 읽으면서 저자들의 자부심이 느껴졌다. 무식하게 데이터 수집만 엄청 하면서 pretrain 하는 것 보다 우린 모델 구조를 뜯어 고쳐봤다!! 라는 자부심. 이게 swag인가?

RoBERTa: A Robustly Optimized BERT Pretraining Approach