Language Distribution

Language Data Visualization

X: Y: Size:

Overview

This visualization shows four metrics for each language, joined on ISO 639-3 codes. Both axes use a log scale. Only languages with nonzero values on both axes are shown in the scatter plot; hover over a point to see all four metrics.

Data sources

Metric	Source	Notes
Speakers	Glottolog (primary); Wikidata P1098 (fallback)	Glottolog's `population` field is used when available. For languages not in Glottolog, the maximum reported value from Wikidata's P1098 property is used as a fallback (queried via SPARQL). Glottolog tends to report L1 speakers while Wikidata often includes L2, so counts may not be strictly comparable across languages.
Wikipedia articles	MediaWiki Siteinfo API	Article counts fetched per-wiki via `action=query&meta=siteinfo&siprop=statistics`. Wiki language codes mapped to ISO 639-3 using the ISO 639 reference table (`Part1` -> `Id`).
FineWeb words	FineWeb-2 language distribution (CSV)	Word counts summed across all subsets and splits per ISO 639-3 code. FineWeb-2 uses language-specific word tokenizers (NLTK, SpaCy, etc.) to count words. FineWeb-2 covers non-English languages; for English, an estimate of ~11.5T words is used, derived from FineWeb v1's 15T GPT-2 tokens divided by ~1.3 tokens/word.
HuggingFace models	HF Model Survey (by_language_stats.csv)	Number of models tagged with each ISO 639-3 language code on HuggingFace Hub.

Macrolanguage aggregation

ISO 639-3 distinguishes macrolanguages (like "Arabic" ara or "Chinese" zho) from their individual member languages (like "Standard Arabic" arb, "Egyptian Arabic" arz, "Mandarin Chinese" cmn, "Yue Chinese" yue, etc.). Different data sources use different granularities:

- FineWeb-2 and HF Model Survey tend to use individual language codes (cmn, arb, arz, ...).

- Wikipedia uses macrolanguage-level codes -- zh.wikipedia.org maps to zho, ar.wikipedia.org maps to ara.

- Wikidata provides speaker counts at both levels.

To produce a meaningful macrolanguage entry, FineWeb word counts, HuggingFace model counts, and Wikipedia article counts for all member languages are summed into the macrolanguage code. The individual member codes remain in the dataset as separate points. Membership is determined by the official ISO 639-3 macrolanguage mappings. Speaker counts are taken directly from Wikidata's macrolanguage entry rather than summed, since Wikidata already reports aggregate figures for macrolanguages.

This aggregation is currently applied to Arabic (30 member codes) and Chinese (19 member codes).

Join strategy

All datasets are joined on ISO 639-3 codes. Wikipedia language codes (approximately ISO 639-1) are mapped to ISO 639-3 via the ISO reference table (Part1 -> Id). A small number of Wikipedia editions with compound codes (e.g., be-tarask, nds-nl, simple) cannot be mapped and are excluded.

Caveats

&nbsp - Speaker counts come from two sources with different methodologies. Glottolog generally reports L1 speakers; Wikidata entries may reflect L1, L2, or total speakers. For languages where only the Wikidata fallback is available, counts may be systematically higher.

- The HF Model Survey is a snapshot and may not reflect current HuggingFace Hub counts. - FineWeb-2 word counts include both train and test splits.

- The English word count (~11.5T) is an estimate. FineWeb v1 reports 15T GPT-2 tokens; no official word count is published. The conversion uses a ratio of ~1.3 tokens per word, which is typical for English web text under GPT-2 tokenization.

- Languages only appear on the scatter plot when both the selected X and Y metrics are nonzero.