Polyglot

The anglocentric nature of western AI research means that the overwhelming majority of resources for training LLMs have been put into training monolingual English models, with monolingual Chinese and generically “massively multilingual” models taking up most of the remaining energies.

The Polyglot Project focuses on extending the benefits of large language models to cultures and contexts not well suited by the current state of affairs, as well as studying the best practices for doing so. This work includes training LLMs in languages other than English and Chinese, improving tools for non-English data documentation, curation, and analysis, culturally-aware research on ethics and bias in non-English LLMs, and more.

The origins of this project come from the BigScience Research Workshop, an international collaboration to train multilingual language models, and the volunteer efforts of many Korean NLP researchers interested in promoting access to NLP technologies. Many EleutherAI members participated in BigScience and contributed key roles to designing, developing, and evaluating models such as BLOOM and mT0.

Most recently, we released the Polyglot-Ko model series. These are monolingual Korean language models with 1.3B, 3.8, and 5.8B parameters, the largest of which is the world's most powerful publicly available Korean language model. We are excited to continue to train and publicly release non-English language models.

Releases

Publications

Previous
Previous

Mesaoptimization