Stella Biderman 16/10/2023 Stella Biderman 16/10/2023

Proof-Pile-2

A 55 billion token dataset of mathematical and scientific documents, created for training the LLeMA models.

Stella Biderman 10/10/2023 Stella Biderman 10/10/2023

OpenWebMath

A 14.7B token dataset of high quality English mathematical text.