Setup

With BERT, the input sequence is a concatenation of two sequences of tokens $(x_{1}, ..., x_{N})$ and $(y_{1}, ..., y_{M})$ , this input sequence is presented as follows:

[C L S], x_{1}, ..., x_{N}, [SEP], y_{1}, ..., y_{M}, [EOS] .

$M$ and $N$ are subject to the constraint $M + N < T$ , where $T$ is the maximal context/sequence length during training. The sequences are delimited by special tokens. The two sequences can be sampled contiguously from the same document with probability 0.5 or from distinct documents.

Training Objective

Pick a random sample of the tokens in the input sequence and replace it with a special token $[M A S K]$ .
MLM objective $\to$ Cross-Entropy Loss on predicting the masked tokens.
For BERT, the following happens $-$
1. Select 15 $%$ of the input tokens for possible replacement with uniform probability.
2. Out of the selected tokens, replace 80 $%$ with $[M A S K]$ , leave 10 $%$ unchanged and replace the rest with a randomly selected token from the vocabulary.

Static vs. Dynamic Masking

In the RoBERTa paper, a dynamic masking strategy is used.

Static Masking: In the original implementation for BERT, masking is done during data preprocessing i.e., it is fixed during the training. This can lead to the same mask during each epoch for a specific sequence/training instance, so instead they duplicate the training set 10 times and mask it in 10 different ways. So, for a training instance, for 40 epochs, it is seen with 4 different masks during the training.
Dynamic Masking: Generate a masking pattern every time a sequence is fed to the model.

Dynamic masking is crucial when pre-training for more steps and larger datasets. It also performs comparably with static masking and is more efficient.

Akash Sharma

Masked Language Modeling

Setup

Training Objective

Static vs. Dynamic Masking

Table of Contents