GPT-Neo is an implementation of model & data-parallel autoregressive language models, utilizing Mesh TensorFlow for distributed computation on TPUs.
Even though it made contributors want to pull their hair out throughout its development and lifetime, GPT-Neo was used to train a family of models between 125 million and 2.7 billion parameters on the TensorFlow Research Cloud. The flagship 1.3B and 2.7B models of this family were trained during
The GPT-Neo codebase is considered deprecated and is no longer maintained. We recommend that those looking for a TPU-centric codebase consider Mesh Transformer JAX, and those looking to run the GPT-Neo models use the implementations available in Hugging Face Transformers.