CARP

GPT-Neo

GPT-Neo is an implementation of model & data-parallel autoregressive language models, utilizing Mesh Tensorflow for distributed computation on TPUs. Even though it made contributors want to pull their hair out throughout its development and lifetime, GPT-Neo was used to train a family of models between 125 million and 2.7 billion parameters on the TensorFlow Research Cloud. The flagship 1.3B and 2.7B models of this family were trained during December 2020 and January 2021....

Mesh Transformer JAX

Mesh Transformer JAX is an implementation of model & data-parallel autoregressive language models, utilizing Haiku and the xmap/pjit operators in JAX to distribute computation on TPUs. It is the designated successor to GPT-Neo. Mesh Transformer JAX was used to train a six billion parameter language model throughout the month of May and first week of June 2021. Upon release on June 8, 2021, GPT-J-6B became the highest-performing autoregressive language model freely available to the public....

OpenWebText2

WebText is an internet dataset created by scraping URLs extracted from Reddit submissions with a minimum score of 3 as a proxy for quality. It was collected for training the original GPT-2 and never released to the public, however researchers independently reproduced the pipeline and released the resulting dataset, called OpenWebTextCorpus (OWT). OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released....

The Pile

The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities. The Pile is now live! Download now, or you can read the docs