NeurIPS Datasets and Benchmarks Stella Biderman NeurIPS Datasets and Benchmarks Stella Biderman

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Fries, Weber, Seelam, et al. (incl. Biderman). "BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing." In the Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at this URL.

Read More
arXiv Stella Biderman arXiv Stella Biderman

Beyond the Imitation Game: Quantifying and extrapolating the capacities of language models

Srivastava, Aarohi, et al. (incl. Phang, Gao, and Biderman). "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." arXiv preprint arXiv:2206.04615, 2022.

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

Read More
NeurIPS Datasets and Benchmarks Stella Biderman NeurIPS Datasets and Benchmarks Stella Biderman

The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset

Laurençon, et al. (incl. Biderman). "The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset." Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. Oral Presentation

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

Read More
BigScience Workshop Stella Biderman BigScience Workshop Stella Biderman

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Zeerak Talat, Aurélie Névéol, et al. (incl. Stella Biderman). "You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings." In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, 2022.

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work.We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages.We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.

Read More
ECCV Stella Biderman ECCV Stella Biderman

VQGAN-CLIP: Open domain image generation and editing

Katherine Crowson*, Stella Biderman*, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. “VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance.” In Proceedings of the European Conference on Computer Vision (ECCV), 2022.

Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demonstrate on a variety of tasks how using CLIP [37] to guide VQGAN [11] produces higher visual quality outputs than prior, less flexible approaches like DALL-E [38], GLIDE [33] and Open-Edit [24], despite not being trained for the tasks presented. Our code is available in a public repository.

Read More
FAccT Stella Biderman FAccT Stella Biderman

Data Governance in the Age of Large-Scale Data-Driven Language Technology

Jernite, Nguyen, et al. (incl. Stella Biderman). "Data Governance in the Age of Large-Scale Data-Driven Language Technology." In the Proceedings of ACM Conference on Fairness, Accountability, and Transparency. 2022

The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.

Read More
ICLR Stella Biderman ICLR Stella Biderman

Multitask-prompted training enables zero-shot task generalization

Victor Sanh*, Albert Webson*, Colin Raffel*, Stephen H. Bach*, and 37 others (incl. Stella Biderman, Leo Gao, and Lintang Sutawika). “Multitask Prompted Training Enables Zero-Shot Task Generalization.” In the Tenth International Conference on Learning Representations (ICLR), 2022. Spotlight Paper

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at this URL and all prompts are available at this URL.

Read More
BigScience Workshop Stella Biderman BigScience Workshop Stella Biderman

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black*, Stella Biderman*, Eric Hallahan*, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. “GPT-NeoX-20B: An Open-Source Autoregressive Language Model.” In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022.

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B's architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at this URL.

Read More
Stella Biderman Stella Biderman

Quality at a glance: An audit of web-crawled multilingual datasets

Julia Kreutzer, Isaac Caswell, et al. (incl. Biderman). “Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets.” Transactions of the Association for Computational Linguistics 10, 50-72. 2022.

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Read More
arXiv Stella Biderman arXiv Stella Biderman

Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources

McMillan-Major, Alyafeai, Biderman, et al. "Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources." arXiv preprint arXiv:2201.10066, 2022.

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

Read More
arXiv Stella Biderman arXiv Stella Biderman

Datasheet for the Pile

Stella Biderman, Kieran Bicheno, and Leo Gao. “Datasheet for the Pile.” Preprint, 2022.

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

Read More
arXiv Stella Biderman arXiv Stella Biderman

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, Aran Komatsuzaki. "LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs." arXiv preprint arXiv: 2111.02114, 2021

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

Read More
EMNLP Stella Biderman EMNLP Stella Biderman

What Language Model to Train if You Have One Million GPU Hours?

Le Scao, et al. (incl. Biderman, Phang, and Lintang Sutawika) "What Language Model to Train if You Have One Million GPU Hours?." arXiv preprint arXiv:2210.15424, 2022.

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at this URL.

Read More
Alignment Forum Stella Biderman Alignment Forum Stella Biderman

Towards Deconfusing Gradient Hacking

When we think about gradient hacking, the most intuitive framing is to consider some kind of agent embedded inside a larger network (like a GPT) that somehow intentionally modifies the loss landscape of the larger network with respect to the base loss, and that this modification makes it so that in optimizing for the base objective, the base optimizer also happens to optimize the mesaobjective. Here I consider the base objective to be a function Θ→R from the params of the network to the reals, that has all the training data baked in for simplicity, and the mesaobjective another function Θ→R, possibly with some constraint that both objectives have to be indifferent between models which behave the same on all inputs. The "somehow" is often considered to be some kind of perturbing or otherwise making the output of the larger network worse whenever the mesaobjective isn't met, therefore creating an incentive for gradient descent to improve the mesaobjective. One example of this line of thinking can be found in my last post about gradient hacking. Unfortunately, I think there are some confusions with this framing. 

Read More
arXiv Stella Biderman arXiv Stella Biderman

Cut the CARP: Fishing for zero-shot story evaluation

Shahbuland Matiana*, JR Smith*, Ryan Teehan*, Louis Castricato*, Stella Biderman*, Leo Gao, and Spencer Frazier. “Cut the CARP: Fishing for zero-shot story evaluation.” arXiv preprint arXiv:2110.03111, 2021.

Recent advances in large-scale language models (Raffel et al., 2019; Brown et al., 2020) have brought significant qualitative and quantitative improvements in machine-driven text generation. Despite this, generation and evaluation of machine-generated narrative text remains a challenging problem. Objective evaluation of computationally-generated stories may be prohibitively expensive, require meticulously annotated datasets, or may not adequately measure the logical coherence of a generated story's narratological structure.


Informed by recent advances in contrastive learning (Radford et al., 2021), we present Contrastive Authoring and Reviewing Pairing (CARP): a scalable, efficient method for performing qualitatively superior, zero-shot evaluation of stories. We show a strong correlation between human evaluation of stories and those of CARP. Model outputs more significantly correlate with corresponding human input than those language-model based methods which utilize finetuning or prompt engineering approaches. We also present and analyze the Story-Critique Dataset, a new corpora composed of 1.3 million aligned story-critique pairs derived from over 80,000 stories. We expect this corpus to be of interest to NLP researchers.

Read More
Alignment Forum Stella Biderman Alignment Forum Stella Biderman

Obstacles to Gradient Hacking

This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example. The discussion was sparked by this post. We didn't end up coming up with any good examples (or proofs of nonexistence) but hopefully this post is helpful for anyone else trying to construct gradient hacking examples.

Note that because our goal is to construct a concrete example of gradient hacking, when I write about "what we want'' and "unfortunate" roadblocks, those are from the perspective of a mesaoptimizer (or a researcher trying to construct an example of a mesaoptimizer to study), not from the perspective of a researcher attempting to build aligned AI.

Read More
arXiv Stella Biderman arXiv Stella Biderman

An Empirical Exploration in Quality Filtering of Text Data

Leo Gao. “An Empirical Exploration in Quality Filtering of Text Data.” arXiv preprint arXiv:2109.00698, 2021.

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

Read More
Journal of Computational Chemistry Stella Biderman Journal of Computational Chemistry Stella Biderman

MP-NeRF: A Massively Parallel Method for Accelerating Protein Structure Reconstruction from Internal Coordinates

Eric Alcaide, Stella Biderman, Amalio Telenti, and M. Cyrus Maher. “MP-NeRF: A Massively Parallel Method for Accelerating Protein Structure Reconstruction from Internal Coordinates.” Journal of Computational Chemistry, 2021.

The conversion of proteins between internal and cartesian coordinates is a limiting step in many pipelines, such as molecular dynamics simulations and machine learning models. This conversion is typically carried out by sequential or parallel applications of the Natural extension of Reference Frame (NeRF) algorithm. This work proposes a massively parallel NeRF implementation which, depending on the polymer length, achieves speedups between 400 and 1200× over the previous state-of-the-art. It accomplishes this by dividing the conversion into three main phases: parallel composition of the monomer backbone, assembly of backbone subunits, and parallel elongation of sidechains; and by batching these computations into a minimal number of efficient matrix operations. Special emphasis is placed on reusability and ease of use. We open source the code (available at https://github.com/EleutherAI/mp_nerf) and provide a corresponding python package.

Read More
The State of AI Ethics Report Stella Biderman The State of AI Ethics Report Stella Biderman

The Hard Problem of Aligning AI to Human Values

Connor Leahy and Stella Biderman. "The Hard Problem of Aligning AI to Human Values." The State of AI Ethics Report 4, p. 180-183. 2021.

We discuss how common framings of AI ethics conversations underestimate the difficulty of the task at hand: if a model becomes dangerous by the mere exposure to unethical content, it is unacceptably dangerous and broken at its core. While gating such models (as OpenAI does with GPT3) behind an API with rudimentary automatic filters plus less rudimentary human moderation is a useful temporary patch, it does not address the underlying problem. These models are fundamentally not doing what we as humans want them to do, which is to act in useful, aligned ways, not just regurgitate an accurate distribution of the text they have been trained on. We need AI that is, like humans, capable of reading all kinds of content, understanding it, and then deciding to act in an ethical manner. Indeed, learning more about unethical ideologies should enhance one's ability to act ethically and fight such toxic beliefs.

Read More
NAACL Workshop on Narrative Understanding Stella Biderman NAACL Workshop on Narrative Understanding Stella Biderman

Towards a Model-Theoretic View of Narratives

Louis Castricato*, Stella Biderman*, Rogelio E. Cardona-Rivera, and David Thue. “Towards a Model-theoretic View of Narratives.” 3rd Workshop on Narrative Understanding at NAACL-HLT 2021, 2021.

In this paper, we propose the beginnings of a formal framework for modeling narrative qua narrative. Our framework affords the ability to discuss key qualities of stories and their communication, including the flow of information from a Narrator to a Reader, the evolution of a Reader’s story model over time, and Reader uncertainty. We demonstrate its applicability to computational narratology by giving explicit algorithms for measuring the accuracy with which information was conveyed to the Reader, along with two novel measurements of story coherence.

Read More