Q: How did this all start?

A: On July 3rd, 2020, Connor Leahy (@Daj) posted in the TPU Podcast Discord:

Hey guys lets give OpenAI a run for their money like the good ol' days

To which Leo Gao (@bmk) replied:

this but unironically

And so it began.

Q: Where did the name come from?

A: In Ancient Greek, eleutheria is a word for “liberty”, and was used as a proper noun as a personification of the concept. This same personage became Libertas to the Romans and Lady Liberty to Americans.

Q: How can I get involved?

A: Join our Discord or check us out on GitHub! We’re an open community, so you are free to contribute as you wish. However, we expect newcomers either to be fairly knowledgeable or to sit on the sidelines until they understand the internal structure and culture of our operations.
If you are interested, check out our page on getting involved.

Q: Are there any other ways to support EleutherAI?

A: Yes. If you or someone you know has access to large quantities of CPU, GPU, or TPU resources, send a message to Sid Black (@Sid) on Discord with more details.

A: Keeping with the theme, our logotype and all images on this website were generated with deep learning techniques.

Q: Where can I go if I have more questions?

A: Discord is the best place for that. Our founding members appear in purple and our core contributors appear in blue. They will be able to provide helpful guidance or answer questions.

However, we ask that you do not expect us to be your tech support; those who contribute to EleutherAI do so in their free time and tend to prefer contributing to projects rather than debugging your problems. We recommend consulting the corresponding documentation before asking us for help. If you think you have found a bug, please consider opening an issue on GitHub.

Q: I’m new to deep learning—How do I get into AI? What is a transformer? Tell me how everything works!

A: We are a research-focused Discord server and not an educational one. We welcome beginners to lurk and talk about topics they are knowledgeable of, but this is not the place to get intro-level resources or answers to basic questions. We have links to several excellent beginner-friendly servers on Discord in the #communities channel.

GPT⁠-⁠Neo and GPT⁠-⁠NeoX

Q: What are GPT⁠-⁠Neo and GPT⁠-⁠NeoX?

A: GPT⁠-⁠Neo and GPT⁠-⁠NeoX are our codebases for training massive language models, for which we plan to release under open licenses. The models themselves are referred to by their size (in millions or billions of parameters).

Q: How big is the largest publically available model you have trained?

A: On March 21st, 2021 we released a 2.7 billion parameter model trained upon the Pile.

Q: Are you serious when you say you are going to train a model comparable to the biggest GPT⁠-⁠3 (175 billion parameters)?

A: Yes, that is the plan. We expect our final model to be somewhere between 150 and 200 billion parameters.

Q: Have you considered the possible risks of creating a model like this?

A: Yes, we have considered the risks of creating such models at length. Although EleutherAI contributors have nuanced opinions, there is a consensus with the following arguments:

  • Given the continuing advancement of technology, it is impossible to prevent these kinds of models from becoming widespread. We cannot put the genie back in the bottle.
  • Any sufficiently funded actor (including but not limited to large corporations and foreign intelligence services) could already have built such models outside of the public eye. There is good reason to believe multiple already have, or are at least in the process of doing so. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models estimates that such models could be completed within a year after Language Models are Few⁠-⁠Shot Learners.
  • Without open access to such models to study, performing critical safety research is difficult. We intend to make these models accessible to assist academics in such research.
  • To entrust the assessments of for-profit corporations on the risks of new technologies is difficult, even if they have the best of intentions. This is especially true when a clear financial incentive to exclusivity exists for those afformentioned new technologies.

Q: When do you plan to have a model of that scale trained? Wouldn’t that take a long time?

We asked some of our GPT⁠-⁠Neo and GPT⁠-⁠NeoX contributors about their predictions, and we got the following responses:

Leo Gao (@bmk)

Before we all become paperclips—if we’re lucky.
Connor Leahy (@Daj)

Before the next Millennium Prize Problem is solved.
Stella Biderman (@StellaAthena)

Exactly 1.21 gigaseconds after you read this.
Shivanshu Purohit (@triggerhappygandi)

In less time than it took Voyager I to reach interstellar space.
Eric Hallahan (@EricHallahan)

Before the heat death of the universe.
Sid Black (@Sid)

A: As a collective of volunteer researchers and engineers who contribute in our free time, we are unable to commit to a timeline as to when larger models will become available in the future. However, our best predictions are consistant in that a model in the range of 150 to 200 billion parameters would be ready no earlier than August 2021. We ideally would like to be done by the end of 2021, and there is no hard deadline on completion except for the heat death of the universe.

Our estimates for how long a model of that magnitude will take to train lie somewhere around the four-to-five-month range with optimization and the right hardware.

Q: How are you training such large models?

A: For GPT⁠-⁠Neo, we utilize our limited access to preemptible TPUs through the TPU Research Cloud (TRC). For our future models to be trained with GPT⁠-⁠NeoX, we have been graciously offered high-performance GPU compute by CoreWeave. CoreWeave is excited by the open nature of the project and is very keen in helping us to break the OpenAI-Microsoft monopoly on massive autoregressive language models.

Q: What differentiates GPT⁠-⁠NeoX from GPT⁠-⁠Neo?

A: GPT⁠-⁠Neo is a codebase built from the ground up upon Mesh Tensorflow, designed for training on TPUs.
Apart from appending the 24th letter of the ISO basic Latin alphabet, GPT⁠-⁠NeoX is an entirely separate, in-development codebase based upon Megatron⁠-⁠LM and DeepSpeed and is designed for GPUs.

Q: Why do you need GPT⁠-⁠NeoX when you have GPT⁠-⁠Neo? Why maintain two codebases?

A: Our motivation for the development of GPT⁠-⁠NeoX is our access to compute resources: It is not realistic for us to use TRC TPUs to train models larger than around 20 billion parameters. Although TRC can potentially provide enough TPUs to train such large models, the compute is unavailable for the time we would need due to the pre-empting of instances. Even with a v3⁠-⁠2048, a model between 150 and 175 billion parameters would require months to train. CoreWeave provides us a path to train models at the scales we would like, but we need to utilize their GPUs for training instead of TPUs.

We, therefore, have two reasons to retire the GPT-Neo codebase in favor of developing GPT-NeoX:

  • Mesh TensorFlow handles TPUs and GPUs differently, and code designed for use with TPUs has no guarantee to work well on GPUs.
  • It makes sense to build a new codebase to take full advantage of GPU hardware—even tiny performance improvements can add up to substantial time and resource savings.

Q: What about volunteer-driven distributed computing, like BOINC, Folding@Home, or hivemind?

A: We have considered the possibility of pooling volunteer resources for training models, but upon thorough review, we have concluded that such approaches are not a viable option today. There are numerous problems with current distributed approaches for us:

  • Backpropagation is dense and sensitive to precision, therefore requiring high-bandwidth communication.
  • Mixture-of-experts-based models tend to significantly underperform monolithic (regular) models for the same number of parameters.
  • Having enough contributors to outweigh the high overhead is infeasible.
  • Verifiability and resistance to outside attack are not currently possible without significant additional overhead.

In short, doing volunteer-driven distributed compute well for this use case is an unsolved problem. If you have expertise in this area, drop us a line and we will be happy to hear you out.

Q: Have you considered more efficient architectures or methods? Have you considered distillation?

A: Our intention is not to perfectly replicate the architecture used by GPT⁠-⁠3 but to instead build models comparable to what OpenAI has built. We are committed to exploring the entire space of architectures and methods, including various linear-scaling attention mechanisms, mixture-of-experts, and other designs. However, in our experience, these designs are not always well suited to language modeling: Attention mechanisms that scale with linear complexity with respect to sequence length are often strictly incompatible with the autoregressive objective used for text generation; the remaining methods have faired poorly in our testing. Engineering is full of trade-offs, and silver-bullet research breakthroughs are uncommon occurences. If and when new methodologies surpass what we have already, we will integrate and use them.

Our agreement with CoreWeave includes a stipulation that we attempt distillation on the final model to make it easier to deploy. It is unknown if distillation is advantageous at these scales, but we intend to find out.

Q: Will I be able to run models on my computer locally, offline?

A: The answer is highly dependent on hardware and configuration.

No, you will not be able to run a model the size of full-scale GPT⁠-⁠3 on your first-generation Macbook Air. 175 billion parameters at single-precision (binary32) take up 700 Gigabytes, and realistically the entire model needs to be loaded into memory for inference. It is unlikely that consumer hardware will be able to run anything of that scale for years to come, even on CPU. To run large models beyond a few billion parameters there is an expectation to utilize systems with large amounts of compute and memory.

Smaller models can be run on more pedestrian hardware: 125 million parameters take up only 500 Megabytes and should run on a basic laptop without a hitch, while 1.3 billion parameters take up 5 Gigabytes and should run on capable personal computers without issue.

If you are interested in inferencing and fine-tuning models, we highly recommend using the implementation in Hugging Face Transformers, which is far easier to both install and use than our research code. We do not support or maintain the Hugging Face implementation beyond our organization in Model Hub, and issues with Transformers or its usage should be directed elsewhere.

Q: Are the codebases free software?

A: GPT⁠-⁠Neo is MIT-licensed, while GPT⁠-⁠NeoX is licensed under Apache 2.0. These are the most freely-termed licenses that we can provide for each codebase respectively.

Q: Are the models free software?

A: EleutherAI is licensing models under Apache 2.0. If you use our models, we would highly appreciate you citing or displaying your usage of them.

The Pile

Q: What’s in the Pile?

A: The Pile is a 1.25 Terabyte dataset constructed from a curated conglomeration of diverse, high-quality text datasets. It covers a wide gamut, from academic writing to legal texts, to online literature, video subtitles, and more. This abundance means that saying precisely what is in this meta-dataset is difficult. If you are interested in exploring this, send a message to #the-pile on Discord.

Q: What’s the format of the Pile?

A: We use a simple, compressed JSON format of our design called lm_dataformat (LMD). It’s designed to make writing, storing, and reading text simple and performant. Every logical document maps to a JSON object with text and meta fields, and batches of these objects are compressed using zstd or gz. Any kind of corpus that goes into the Pile—whether HTML, ePUB, PDF extraction, etc.—will be converted into LMD.

Q: Who can use the Pile?

A: The Pile was primarily designed for researchers training large-scale language models. It also may be of interest to other researchers interested in topics such as bias, online discourse, and text compression.

Q: Is the Pile released yet?

A: Yes! Read the preprint on arXiv here.

Q: Where can I get the Pile?

A: We provide all of the code necessary to replicate the Pile yourself. Additionally, the community of data aficionados at The-Eye are distributing pre-built versions as well.

Q: Can I add something to the Pile?

A: Yes! All contributions should be sent to the version2 branch. Pile v1 is finalized and is no longer accepting contributions.

Q: Have you considered adding Discord logs?

A: Yes. We decided against it, as there are good privacy reasons Discord users may not expect or want their conversations unwittingly added to a public dataset like this. Collecting such a dataset would most likely also violate Discord’s ToS. In general, more trouble than they’re worth.

Q: Can I make my own version of the Pile?

A: Of course! For just this reason, all of the components and the Pile creation process are reproducible. Look for a repo labeled as pile-[COMPONENT] or pile_[COMPONENT] if you want to reproduce a component. This repo is where you should go if you want to build your own Pile out of the same base datasets. We may also provide links to pre-processed components to allow you to mix, match, and re-sample to derive your own.