How did this all start?

One day, Connor Leahy posted in the TPU Podcast Discord:

To which Leo Gao replied:

@Daj this but unironically

And so it began.

Where did the name come from?

In Ancient Greek, eleutheria is a word for “liberty”, and was used as a proper noun as a personification of the concept. This same personage became Libertas to the Romans and Lady Liberty to Americans.

So . . . what’s the deal with your logo?

Keeping with the theme, our logotype and all images (other than the logo) on this website were generated with deep learning techniques. Our triangular logo was designed by Sid Black and later refined by Eric Hallahan.

What ways are there to support EleutherAI?

We are lucky to currently be flush with most material support we could need, and we are primarily bottlenecked by man-hours. What we need is energetic contributors that can contribute to and lead projects of their own! We are happy to support promising AI research, in particular alignment-relevant work.

How can I get involved?

Join our Discord or check us out on GitHub! We’re an open community, so you are free to contribute as you wish. However, we expect newcomers either to be fairly knowledgeable or to sit on the sidelines until they understand the internal structure and culture of our operations.

Where can I go if I have more questions?

For general questions, Discord is the best place for that. Our founding members appear in purple and our core contributors appear in sea green and blue. They will be able to provide helpful guidance or answer questions.

For more professional inquiry, drop us a line at

We politely ask that you do not expect us to be your tech support; those who contribute to EleutherAI do so in their free time and tend to prefer contributing to projects rather than debugging your problems. We recommend consulting the corresponding documentation before asking us for help. If you think you have found a bug, please consider opening an issue on GitHub.

I’m new to deep learning—How do I get into AI? What is a transformer? Tell me how everything works!

We are a research-focused Discord server and not an educational one. We welcome beginners to lurk and talk about topics they are knowledgeable of, but this is not the place to get intro-level resources or answers to basic questions. We have links to several excellent beginner-friendly servers on Discord in the #communities channel.

Large Language Models

What are GPT-Neo and GPT-NeoX?

GPT-Neo and GPT-NeoX are our codebases for training massive language models, for which are released under open licenses. The models themselves are referred to by their size (in millions or billions of parameters).

All active work is in GPT-NeoX, and GPT-Neo should be considered deprecated. For those looking for code that runs on TPUs, we recommend Mesh Transformer JAX instead.

What is the largest model you have trained?

On February 9, 2022, we released a 20 billion parameter model trained on the Pile, GPT-NeoX-20B.

Are you serious when you say you are going to train a model comparable to the biggest GPT-3 (175 billion parameters)?

Yes, that is the plan. We expect our final model to be somewhere between 150 and 200 billion parameters.

Have you considered the possible risks of creating models like these?

Yes, we have considered the risks of creating and releasing such models at length, and have come to the conclusion that we believe the benefits outweigh the risks.

When do you plan to have more models available?

As a collective of volunteer researchers and engineers who contribute in our free time, we are unable to commit to either a timeline or a roadmap for future models.

How are you training such large models?

For GPT-Neo and GPT-J, we utilized our access to preemptible TPUs through the TPU Research Cloud (TRC).

For GPT-NeoX, we have been graciously offered high-performance GPU compute by CoreWeave. CoreWeave is excited by the open nature of the project and has been instrumental in helping us scale our efforts to train larger autoregressive language models.

What differentiates GPT-Neo, GPT-NeoX, and Mesh Transformer JAX?
Why develop so many codebases?

Built from the ground up upon Mesh Tensorflow for training on TFRC TPUs, GPT-Neo was our first attempt at building a codebase that could scale to many billions of parameters. The Hobson’s choice of Mesh Tensorflow did not come without consequences; the TPU-first design of the framework forever constrained GPT-Neo. This, in addition to the desire for a codebase less prone to having contributors want to pull their hair out, led us to pursue alternative means for our further endeavors in scaling. Our distaste with the codebase did not stop us from putting it to good use though—it most notably resulted in GPT-Neo 1.3B and 2.7B, released March 21, 2021.

Apart from appending the 24th letter of the ISO basic Latin alphabet, GPT-NeoX is our GPU codebase. Built upon Megatron-LM and DeepSpeed, it was motivated by our access to compute resources: It is unrealistic for us to train models larger than a few tens of billions of parameters on TRC TPUs in reasonable time due to the pre-empting of instances. CoreWeave offered us a path to train models at the scales we wanted, but we needed to utilize their GPUs for training instead of TPUs. As such, it made sense to build a new codebase to take full advantage of their GPU hardware.

Mesh Transformer JAX came as a stopgap solution to the woes of an international chip shortage blocking much of GPT-NeoX development throughout the spring and summer of 2021. What started as work towards a DALL-E-like multimodal model was soon rescoped to train more traditional language models. Built upon JAX and Haiku with the then-new TPU VM in mind, it is the designated successor to the original GPT-Neo codebase for training models on TPUs. This project ultimately resulted in the release of GPT-J-6B on June 8, 2021.

What about volunteer-driven distributed computing, likeĀ BOINC, Folding@Home, orĀ hivemind?

We have considered the possibility of pooling volunteer resources for training models, but upon thorough review, we have concluded that such approaches are not a viable option today. There are numerous problems with current distributed approaches for us:

  • Backpropagation is dense and sensitive to precision, therefore requiring high-bandwidth communication. Consumer-grade internet connections are wholly insufficient.
  • Mixture-of-experts-based models tend to significantly underperform monolithic (regular) models for the same number of parameters.
  • Having enough contributors to outweigh the high overhead is infeasible.
  • Verifiability and resistance to outside attack are not currently possible without significant additional overhead.

In short, doing volunteer-driven distributed compute well for this use case is an unsolved problem.

Have you considered more efficient architectures or methods?

Our intention is not to perfectly replicate the architecture used by GPT-3 but to instead build models comparable to what OpenAI has built. We are committed to exploring the entire space of architectures and methods, including various linear-scaling attention mechanisms, mixture-of-experts, and other designs. However, in our experience, these designs are not always well suited to language modeling: Attention mechanisms that scale with linear complexity with respect to sequence length are often strictly incompatible with the autoregressive objective used for text generation; the remaining methods have faired poorly in our testing. Engineering is full of trade-offs, and silver-bullet research breakthroughs are uncommon occurrences. If and when new methodologies surpass what we have already, we will integrate and use them.

Will I be able to run models on my computer locally, offline?

The answer is highly dependent on hardware and configuration.

No, you will not be able to run a model the size of full-scale GPT-3 on your first-generation Macbook Air. 175 billion parameters at single-precision (binary32) take up 700 Gigabytes, and realistically the entire model needs to be loaded into memory for inference. It is unlikely that consumer hardware will be able to run anything of that scale for years to come, even on CPU. To run large models beyond a few billion parameters there is an expectation to utilize systems with large amounts of compute and memory.

Smaller models can be run on more pedestrian hardware: 125 million parameters take up only 500 Megabytes and should run on a basic laptop without a hitch, while 1.3 billion parameters take up 5 Gigabytes and should run on capable personal computers without issue.

If you are interested in inferencing and fine-tuning models, we can recommend using the implementations of GPT-Neo and GPT-J found in Hugging Face Transformers, which is far easier to both install and use than our research code. We do not support or maintain the Hugging Face implementation beyond our organization in Model Hub, and issues with Transformers or its usage should be directed elsewhere (such as the Hugging Face community forums).

Are the codebases free software?

GPT-Neo is MIT-licensed, while GPT-NeoX is licensed under Apache 2.0. These are the most freely-termed licenses that we can provide for each codebase respectively.

Mesh Transformer JAX is licenced under Apache 2.0.

Are the models free software?

EleutherAI is licensing models under Apache 2.0. If you use our models, we would highly appreciate you citing or displaying your usage of them.

How should I cite your models?

We ask that you cite both the codebase and the dataset together when citing models. For example, our recommended citation method for the GPT-Neo models is as follows.

In the document body:

X.XB GPT-Neo \citep{gpt-neo} model trained on the Pile \citep{pile}

BibTeX entries:

  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
  journal={arXiv preprint arXiv:2101.00027},
  author = {Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella},
  title = {{GPT-Neo}: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow},
  url = {},
  version = {1.0},
  year = {2021}

The Pile

What is the Pile? What is in it?

The Pile is a 825 GiB diverse, open source language modeling dataset that consists of 22 smaller, high-quality datasets combined together. For more information, please read the paper or the datasheet on arXiv.

Who can use the Pile?

The Pile was primarily designed for researchers training large-scale language models. It also may be of interest to other researchers interested in topics such as bias, online discourse, and text compression.

Where can I get the Pile?

The data can be downloaded here.

Can I add something to the Pile?

Pile v1 is finalized and is no longer accepting contributions.

Have you considered adding Discord logs?

Yes. We decided against it, as there are good privacy reasons Discord users may not expect or want their conversations unwittingly added to a public dataset like this. Collecting such a dataset would most likely also violate Discord’s ToS. In general, more trouble than they’re worth.

Can I make my own version of the Pile?

Of course! For just this reason, all of the components and the Pile creation process are reproducible. The code used to create the Pile can be found here. Links to the code for reproducing each component are also available at that repo.