FLORES-200 (CC BY-SA 4.0, managed by OLDI / Meta FAIR). 1,012 professionally translated sentences across 200 languages. We use the dev split for the public benchmark. The devtest split is kept as a private held-out validation set to monitor for benchmark drift across model versions. Raw sentences are never published — only aggregated metrics.
Chars/token — total characters divided by total tokens across all benchmark sentences. Higher is better: more meaning per token means the model sees more coherent units.
Fertility — tokens per word. Lower is better: fewer fragments per word means the model reasons over concepts rather than byte sequences.
RTC (Relative Tokenization Cost) — English chars/token divided by the language's chars/token on the same model. A score of 2.8× means that language requires 2.8× more tokens to express the same content as English. English baseline = 1.0× by definition. This is the multiplier you apply to your own token costs.
Public tokenizers (GPT-4o via tiktoken, LLaMA 3 / Mistral / Qwen / Gemma via HuggingFace) are run locally — no API calls, exact counts. Closed tokenizers (Claude, Gemini) use the official count_tokens endpoints provided by Anthropic and Google respectively. Results are cached to avoid redundant API calls.
Tokenizer efficiency is determined at training time and frozen into the model vocabulary. A model cannot overfit its tokenizer to FLORES at inference time — the tokenizer either fragments a word efficiently or it doesn't, regardless of prior exposure. This makes tokenization benchmarks significantly more contamination-resistant than comprehension or translation benchmarks.
Raw FLORES+ sentences are never published in any public artifact. The benchmark scripts are open source so results are fully reproducible, but the corpus itself must be downloaded directly from HuggingFace with an accepted terms agreement — consistent with FLORES maintainers' request not to re-host plain text.
Each benchmark release is versioned by date and pins specific tokenizer library versions. The benchmark script is published at github.com/inimaz/mothertoken. To reproduce: clone the repo, authenticate with Hugging Face if needed, then run mothertoken-benchmark --dry-run and mothertoken-benchmark.
MIT licensed. Contributions welcome — especially new model support and additional language coverage.