RoBERTa: A Robustly Optimized BERT Pretraining Approach(2019)

August 27, 2023

RoBERTa(Robusty optimized BERT approach) is an improved recipe for training BERT models. BERT uses two objectives, masked language modeling and next sequence prediction (NSP), during pretraining. RoBERTa uses masked language modeling only. The authors increased the batch size and Byte-Pair Encoding (BPE) vocabulary size. RobERTa is trained with byte-level BPE, which uses bytes instead of unicode characters as the base subword units.

BERT takes as input a concatenation of two segments, $x_1,\dots , x_N$ and $y_1, \dots y_M$ . The concatenation is presented as a single sequence with special delimiters: $[CLS], x_1,\dots , x_N , [SEP], y_1,\dots , y_M, [EOF]$ . $M$ and $N$ are constrained so that $M + N < T$ , where $T$ is a parameter that controls the maximum sequence length during training.

The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original token based only on its context. The original BERT implementation duplicates training data 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. It results in that each training sequence is seen with the same mask four times during training. RoBERTa generates the masking pattern every time a sequence is fed to the model.

NSP is a binary classification loss for predicting whether two segments follow each other in the original text. RoBERTa is not trained with NSP. Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most $T$ tokens.

BPE is a simple data compression technique that iteratively replaces the most frequent pair of symbols in a sequence with a single unused symbol. The original BERT implementation uses a character-level BPE vocabulary of size 30K. RoBERTa instead uses byte-level BPE vocabulary containing 50K subword units.