Dataset Stella Biderman Dataset Stella Biderman

Proof-Pile-2

A 55 billion token dataset of mathematical and scientific documents, created for training the LLeMA models.

Read More
Dataset Stella Biderman Dataset Stella Biderman

Simulacra Aesthetic Captions

A dataset of prompts, synthetic AI generated images, and aesthetic ratings of those images.

Simulacra Aesthetic Captions is a dataset of over 238000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts. The images are rated on their aesthetic value from 1 to 10 by users to create caption, image, and rating triplets. In addition to this each user agreed to release all of their work with the bot: prompts, outputs, ratings, completely public domain under the CC0 1.0 Universal Public Domain Dedication. The result is a high quality royalty free dataset with over 176000 ratings.

Read More
Dataset Stella Biderman Dataset Stella Biderman

The Pile

A large-scale corpus for training language models, composed of 22 smaller sources. The Pile is publicly available and freely downloadable, and has been used by a number of organizations to train large language models.

The Pile is a curated collection of 22 diverse high-quality datasets for training large language models.

Read More
Dataset Stella Biderman Dataset Stella Biderman

OpenWebText2

OpenWebText2 is an enhanced version of the original OpenWebTextCorpus, covering all Reddit submissions from 2005 up until April 2020. It was developed primarily to be included in the Pile.

OpenWebText2 is an enhanced version of the original OpenWebTextCorpus, covering all Reddit submissions from 2005 up until April 2020. It was developed primarily to be included in the Pile.

Read More