FLORES-200 (CC BY-SA 4.0, managed by OLDI / Meta FAIR). 1,012
professionally translated sentences across 200 languages. We use the dev split for the public benchmark. The devtest split is kept as a private held-out
validation set to monitor for benchmark drift across model versions. Raw sentences
are never published — only aggregated metrics.
metrics
Chars/token — total characters divided by total tokens across
all benchmark sentences. Higher is better: more meaning per token means the
tokenizer packs the same text into fewer units.
Fertility — tokens per word. Lower is better: fewer fragments
per word means less fragmentation before the model sees the text.
RTC (Relative Tokenization Cost) — English chars/token divided
by the language's chars/token on the same model. A score of 2.8× means that
language requires 2.8× more tokens to express the same content as English.
English baseline = 1.0× by definition. This is the multiplier you apply to
your own token costs.
tokenizer coverage
The public benchmark is a curated starter set of token counters, not an
exhaustive model compatibility map. Local counters run via tiktoken or
Hugging Face tokenizer files with exact counts. The CLI can compare additional
user-supplied Hugging Face model paths without adding them to the bundled benchmark.
benchmark integrity
Tokenizer efficiency is determined by tokenizer files such as vocabulary,
merge rules, normalization, and special tokens. A model cannot change those
files at inference time. This makes tokenization benchmarks significantly
more contamination-resistant than comprehension or translation benchmarks.
Raw FLORES+ sentences are never published in any public artifact. The
benchmark scripts are open source so results are fully reproducible, but
the corpus itself must be downloaded directly from HuggingFace with an
accepted terms agreement — consistent with FLORES maintainers' request not
to re-host plain text.
reproducibility
Each benchmark release is versioned by date and pins specific tokenizer
files and library versions. The benchmark script is published at github.com/inimaz/mothertoken.
To reproduce it yourself, see the benchmarking documentation for detailed instructions.
open source
MIT licensed. Contributions welcome — especially better model coverage,
comparison workflows, and additional language coverage.