Language Modeling
The ability of a computer to understand, interpret, and generate human language is at the heart of what we do at EleutherAI.
Current Projects
Releases
A 55 billion token dataset of mathematical and scientific documents, created for training the LLeMA models.
A 14.7B token dataset of high quality English mathematical text.
A series of Korean autoregressive language models made by the EleutherAI polyglot team. We currently have trained and released 1.3B, 3.8B, and 5.8B parameter models.
Papers
As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, and Edward Grefenstette. "Large language models are not zero-shot communicators." arXiv preprint arXiv:2210.14986, 2022.
Reinforcement learning from human feedback (RLHF) utilizes human feedback to better align large language models with human preferences via online optimization against a learned reward model. Current RLHF paradigms rely on Proximal Policy Optimization (PPO), which quickly becomes a challenge to implement and scale up to large architectures. To address this difficulty we present the trlX library as a feature-complete open-source framework for RLHF fine-tuning of models up to and exceeding 70 billion parameters. We implement support for multiple types of distributed training including distributed data parallel, model sharded, as well as tensor, sequential, and pipeline parallelism.
To increase the accessibility of RLHF to researchers, we implement compute- and memory-saving features that give trlX the flexibility to support users with a wide range of compute resources. This includes offline RL methods like Implicit Language Q Learning (ILQL), low-rank adapters, and the Hydra architecture. We find offline fine-tuning offers competitive performance relative to online algorithms while being easier to implement, train, and scale. To evaluate our framework we train RLHF models on two separate well-known tasks using publicly available human preference data. Models trained with trlX achieve preference win-rates over baselines at rates comparable to the original works.
Azerbayev, Piotrowski, Schoelkopf, Ayers, Radev, and Avigad. "ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics." arXiv preprint arXiv:2302.12433 (2023).
Allal, Li, Kocetkov, et al. "SantaCoder: don't reach for the stars!." arXiv preprint arXiv:2301.03988 (2023).
Yong, Schoelkopf, Muennighoff, et al. "BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting." arXiv preprint arXiv:2212.09535 (2022).
Jason Phang, Yi Mao, Pengcheng He, Weizhu Chen. "HyperTuning: Toward Adapting Large Language Models without Back-propagation." arXiv preprint arXiv:2211.12485, 2022
Le Scao, et al. (incl. Tow, Biderman, Ammanamanchi, Gao, Sutawika, Teehan). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint arXiv: 2211.05100, 2022.

