Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer(2020)
The authors of Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer performed experiments with Text-to-Text Transfer Transformer(T5), a unified framework for NLP.
The basic idea underlying T5 is to treat various NLP problems as taking text as input and producing new text as output.
Their goal is to explore general language learning abilities instead of providing new methods.
They are interested in exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
To perform experiments at scale, they created Colossal Clean Crawled Corpus (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web.
They leveraged Common Crawl to create C4.
They used heuristic for cleaning up Common Crawl web-extracted text, and used langdetect to filter out any pages that are not classified as English documents.