FLORES-200 (CC BY-SA 4.0, managed by OLDI / Meta FAIR). 1,012
professionally translated sentences across 200 languages. We use the dev split for the public benchmark. The devtest split is kept as a private held-out
validation set to monitor for benchmark drift across model versions. Raw sentences
are never published — only aggregated metrics.
metrics
Chars/token — total characters divided by total tokens across
all benchmark sentences. Higher is better: more meaning per token means the
model sees more coherent units.
Fertility — tokens per word. Lower is better: fewer fragments
per word means the model reasons over concepts rather than byte sequences.
RTC (Relative Tokenization Cost) — English chars/token divided
by the language's chars/token on the same model. A score of 2.8× means that
language requires 2.8× more tokens to express the same content as English.
English baseline = 1.0× by definition. This is the multiplier you apply to
your own token costs.
tokenizer coverage
Public tokenizers (GPT-4o via tiktoken, LLaMA 3 / Mistral / Qwen / Gemma
via HuggingFace) are run locally — no API calls, exact counts. Closed
tokenizers (Claude, Gemini) use the official
count_tokens endpoints provided by Anthropic and Google respectively.
Results are cached to avoid redundant API calls.
benchmark integrity
Tokenizer efficiency is determined at training time and frozen into the
model vocabulary. A model cannot overfit its tokenizer to FLORES at
inference time — the tokenizer either fragments a word efficiently or it
doesn't, regardless of prior exposure. This makes tokenization benchmarks
significantly more contamination-resistant than comprehension or
translation benchmarks.
Raw FLORES+ sentences are never published in any public artifact. The
benchmark scripts are open source so results are fully reproducible, but
the corpus itself must be downloaded directly from HuggingFace with an
accepted terms agreement — consistent with FLORES maintainers' request not
to re-host plain text.
reproducibility
Each benchmark release is versioned by date and pins specific tokenizer
library versions. The benchmark script is published at github.com/inimaz/mothertoken. To reproduce: clone the repo, authenticate with Hugging Face if
needed, then run mothertoken-benchmark --dry-run and
mothertoken-benchmark.
open source
MIT licensed. Contributions welcome — especially new model support and
additional language coverage.