WebText is an internet dataset created by scraping URLs extracted from Reddit submissions with a minimum score of 3 as a proxy for quality. It was collected for training the original GPT-2 and never released to the public, however researchers independently reproduced the pipeline and released the resulting dataset, called OpenWebTextCorpus (OWT).
OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released.
OpenWebText2 is now live!
Comes pre-cleaned and pre-processed:
- Deduplicated by URL
- Filtered by minimum combined reddit score 3
- Deduplicated at document level with MinHashLSH.
- 17,103,059 documents
- 65.86 GB uncompressed text
- 28 GB compressed including text and metadata