Stella Biderman 16/10/2023 Stella Biderman 16/10/2023

Proof-Pile-2

A 55 billion token dataset of mathematical and scientific documents, created for training the LLeMA models.

Stella Biderman 10/10/2023 Stella Biderman 10/10/2023

OpenWebMath

A 14.7B token dataset of high quality English mathematical text.

Stella Biderman 29/06/2022 Stella Biderman 29/06/2022

Simulacra Aesthetic Captions

A dataset of prompts, synthetic AI generated images, and aesthetic ratings of those images.

Simulacra Aesthetic Captions is a dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts. The images are rated on their aesthetic value from 1 to 10 by users to create caption, image, and rating triplets. In addition to this each user agreed to release all of their work with the bot: prompts, outputs, ratings, completely public domain under the CC0 1.0 Universal Public Domain Dedication. The result is a high quality royalty free dataset with over 176000 ratings.

Stella Biderman 31/12/2020 Stella Biderman 31/12/2020

The Pile

A large-scale corpus for training language models, composed of 22 smaller sources. The Pile is publicly available and freely downloadable, and has been used by a number of organizations to train large language models.

The Pile is a curated collection of 22 diverse high-quality datasets for training large language models.

Stella Biderman 30/12/2020 Stella Biderman 30/12/2020

OpenWebText2

OpenWebText2 is an enhanced version of the original OpenWebTextCorpus, covering all Reddit submissions from 2005 up until April 2020. It was developed primarily to be included in the Pile.