BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019)

#Transformer

October 7, 2023

BART is a denoising autoencoder for Pretraining sequence-to-sequence models. BART is trained on corrupted text, and updates the parameters to reconstruct the original text. The authors experimented with several noising functions that corrupt text like token masking, token deletion, text infilling, sentence permutation, and document rotation. BART with text infilling, where text spans are sampled with span lengths drawn from a Poisson distribution( $\lambda = 3$ ), demonstrated the most consistently strong performance.

BART is a sequence-to-sequence model with a bidirectional encoder and a left-to-right autoregressive decoder. BART replaces the ReLUs in Transformer with GeLUs. The authors of BART see the model as a standard Transformer-based neural machine translation architecture that generalizes BERT and GPT. Comparing BART and BERT, only BERT uses an additional feed-forward network before word-predictions, and the second multi-head attention mechanism in each decoder performs cross-attention over the final hidden layer of the encoder.

If a downstream task is a classification task, the same input is fed into the encoder and decoder. An additional token like CLS in BERT is added to the end, and its final hidden state of the final decoder is fed into a linear classifier.