Skip to content
B

Loading experience

NLPTokenizationLLMsIndicResearch

The Multilingual Deficit: Tokenization Inequality in LLMs

Large Language Models systematically disadvantage non-English languages at every level of their architecture — from the first byte of tokenization to the final generated word.

The Starting Point

I grew up speaking Kannada at home and English at school. That linguistic duality made me acutely aware, early on, of how differently these two languages behave in digital systems. When I started working with LLMs professionally — first in research at SJSU, then building pipelines that processed multilingual text — a pattern kept surfacing that I couldn't explain away: Kannada inputs consistently underperformed. The model would truncate context, miss nuance, produce outputs that felt like they'd been translated twice.

It took a proper quantitative analysis to understand why. The answer starts at tokenization — and it cascades upward through every layer of the architecture.


The Tokenization Architecture

Modern LLMs use Byte-Pair Encoding (BPE) tokenizers, which operate on raw UTF-8 bytes rather than characters or words. The tokenizer learns merge rules from its training corpus: pairs of bytes that appear frequently get merged into single tokens, building up vocabulary entries that range from individual characters to full English words.

The problem is in what "frequently" means in practice.

BPE TOKENIZATION — ENGLISH vs INDIC SCRIPT

INPUT: "Harvest Festival" 15 ASCII bytes · 1 byte per character INPUT: Sankranti [Kannada] 9 Unicode points · 27 UTF-8 bytes · 3 bytes/char BPE MERGE RULES Learned from corpus ~95% English — Indic byte pairs appear with low frequency OUTPUT — 2 TOKENS [Harvest] [Festival] Semantic units preserved — model understands meaning OUTPUT — 8-12 BYTE FRAGMENTS [0xE0][0xB2][0xB8] [0xE0][0xB2][0x82] [0xE0][0xB2][0x95][0xE0][0xB3][0x8D]... No semantic meaning — raw UTF-8 bytes DOWNSTREAM CONSEQUENCES ▶ 3-5x more API tokens consumed per equivalent semantic content ▶ 128K context window shrinks to ~25K effective tokens for Kannada ▶ English Pivot — model reasons in English, re-translates output back ▶ $262K-$365K annual cost premium for Hindi-language API services

TOKEN FERTILITY (tokens per word relative to English) English 1x Hindi 2-3x Kannada/Tamil 3-5x


The Fertility Crisis: Quantitative Analysis

Token fertility is the average number of tokens produced per input word. It's the cleanest single metric for tokenization efficiency, and the numbers are stark.

My analysis benchmarked GPT-4 and GPT-4o tokenization across 17 languages using standardized corpora:

| Language | GPT-4 avg tokens | GPT-4o avg tokens | Reduction | vs English | |:---|:---:|:---:|:---:|:---:| | Malayalam | 4,775 | 957 | 79% | ~4–5× | | Kannada | 3,681 | 766 | 79% | ~3–4× | | Telugu | 4,097 | 893 | 77% | ~3–4× | | Tamil | 3,949 | 948 | 74% | ~3–4× | | Hindi | 2,090 | 655 | 64% | ~2–3× | | Urdu | 2,428 | 854 | 62% | ~2–3× | | English | ~500 | ~500 | — | 1× |

GPT-4o's dramatic improvement (64–79% reduction) shows that vocabulary expansion does help — but the relative disadvantage of Indic languages persists. Malayalam users went from paying a 9.5× token tax compared to English in GPT-4 to paying a 4–5× tax in GPT-4o. Progress, but still structurally inequitable.


Three Bytes Per Character: The Root Cause

The UTF-8 encoding of Indic scripts is the seed of the problem. A Latin character like "A" encodes as a single byte. A Kannada character like ಸ (sa) encodes as three bytes: 0xE0 0xB2 0xB8. The BPE algorithm, which learns its merge rules from a training corpus that was up to 95% English-language text (in early models like Llama 1 and 2), never saw Kannada byte sequences with enough frequency to form stable merge pairs.

The result: Kannada text gets tokenized at the byte level, not the character or morpheme level. A word that carries complete semantic meaning becomes 8–12 meaningless byte fragments. The model's attention mechanism then has to work across those 12 fragments to reconstruct what a human reader perceives as one unit.

The Sankrānti example (ಸಂಕ್ರಾಂತಿ — a harvest festival of enormous cultural significance):

  • Unicode codepoints: 9 (ಸ + ಂ + ಕ + ್ + ರ + ಾ + ಂ + ತ + ಿ)
  • UTF-8 bytes: 27
  • BPE result in a poorly-optimized tokenizer: fragmented across 8–12 tokens
  • English equivalent ("Harvest Festival"): 2 tokens

The model doesn't "know" ಸಂಕ್ರಾಂತಿ as a cultural concept. It knows a sequence of byte fragments that happen to co-occur in certain contexts.


The Economic Consequence: The Token Tax

The fertility disparity isn't just an academic concern about representation quality — it has direct economic consequences that fall disproportionately on the languages and regions that are already least served by AI systems.

The scaling deficit. A company building a Hindi-language customer service application pays an estimated $262,000–$365,000 more annually per million API calls than an identical English-language service at the same semantic scale. Every Hindi token costs the same as every English token. But delivering equivalent semantic content requires 2–3× more tokens in Hindi.

Context window shrinkage. A 128K-token context window — which OpenAI advertises as a breakthrough in reasoning over long documents — effectively becomes 25,000–42,000 tokens of real content capacity for Hindi or Arabic users. A legal document, a technical specification, a long-form narrative: all of these get truncated earlier for non-English users, degrading the model's ability to reason over the full text.

The HDI correlation. There's a −0.5 correlation between a country's Human Development Index and the LLM token cost for its dominant languages. The countries least able to absorb the cost penalty are paying the highest per-semantic-unit price. This isn't incidental — it's a structural feature of how these systems were built.


Three Architectural Bottlenecks

The tokenization problem isn't isolated. It propagates upward through the model architecture in three specific ways:

1. The English Pivot

Decoder-only LLMs (GPT, Llama, Mistral) often convert non-English inputs into English-like internal representations before reasoning. The model was trained overwhelmingly on English text, so its "native" reasoning substrate is English-latent space. When processing Kannada, the model encodes the fragmented tokens, internally pivots toward an English representation, reasons in that space, then generates output back in Kannada.

This intermediate translation creates an information bottleneck: linguistic structures that have no clean English analogue — Kannada's compound verbs, agglutinative suffixes, script-level ligatures — don't survive the pivot intact. The output is technically in Kannada but structurally calqued on English syntax.

2. Positional Encoding Depletion

Fixed-length transformer architectures allocate positional encodings across the token sequence. If a Kannada user needs 4× more tokens to express the same content as an English user, they're consuming 4× as many positional encoding slots. For a 4,096-token context window, a Kannada user effectively gets 1,024 tokens of "memory" for the same content that gives an English user 4,096. Attention patterns grow increasingly diffuse as the sequence lengthens, compounding the quality degradation.

3. Morphological Mismatch in Arabic and Urdu

Semitic languages like Arabic use non-concatenative morphology — meaning changes happen through vowel patterns woven between root consonants (e.g., k-t-b produces كَتَبَ kataba "he wrote," كِتَاب kitāb "book," مَكْتَب maktab "desk"). BPE's sequential left-to-right merging is fundamentally incompatible with this interleaving structure. It can learn common whole-word forms, but it fails at systematic morphological decomposition, which means it handles familiar words adequately but falls apart on low-frequency or derived forms.


What the Frontier Looks Like

The research community is building solutions, and some of them are genuinely exciting:

MAGNET (NeurIPS 2024) replaces fixed BPE tokenizers with language-script-specific boundary predictors. Instead of learning from byte frequency, MAGNET learns where character boundaries actually are in each script, then uses that to guide merging. Reported 3× token reduction for Indic languages without vocabulary expansion.

EEVE (2024) enables zero-shot cross-lingual vocabulary transfer. A model trained on English can "learn" new vocabulary from a small amount of target-language data by transplanting token embeddings rather than retraining the full tokenizer. The approach allows models to extend their effective vocabulary without the cost of a full training run.

IndicTrans2 is the most pragmatic response: build a dedicated translation model for all 22 constitutionally scheduled Indian languages, trained on 230 million bitext pairs. The model doesn't pretend BPE works for Indic scripts — it bypasses the deficit by routing through translation as an explicit intermediate step rather than relying on cross-lingual transfer through a shared tokenizer.

Sovereign Indic AI — Sarvam, Krutrim, and AI4Bharat's Bhashini initiative — is taking the most structural approach: build new foundation models from scratch on Indic-language-first corpora, with tokenizers trained on those languages rather than adapted to them. The vocabulary distributions look fundamentally different. The fertility ratios are dramatically better. The quality ceiling is higher.


What This Research Is Actually About

The technical framing is tokenization and fertility ratios and positional encoding depletion. But the actual subject is who gets to participate in the AI era on equal terms.

Kannada is spoken by 45 million people. Telugu by 80 million. Hindi by 600 million. These aren't niche languages or edge cases — they're primary languages for a significant fraction of the world's population. When LLMs systematically disadvantage these languages at the byte level, the effects compound: lower-quality outputs create lower trust, lower trust drives lower adoption, lower adoption means less training data for the next model generation, which further entrenches the disadvantage.

The path forward requires treating vocabulary diversity as a first-class design constraint rather than an afterthought. That means multilingual tokenizers trained with proportional corpus representation. It means fertility ratios as a tracked metric alongside perplexity and BLEU scores. And it means being honest that "our model supports 50 languages" doesn't mean those 50 languages experience equivalent quality — not until the infrastructure underneath the model reflects that commitment.


Stack and Benchmarks

  • Tokenization benchmarks: MILU, IndicMMLU-Pro, MMLU-ProX
  • Fertility analysis: custom Python tooling using tiktoken, sentencepiece, and transformers
  • Key sources: Petrov et al. (NeurIPS 2023), Ahia et al. (EMNLP 2023), AI4Bharat
  • Technologies analyzed: SentencePiece, Byte-Pair Encoding, LaBSE, mE5, IndicTrans2

Full Research Paper

The complete quantitative analysis, including fertility benchmarks across 17 languages, cost modeling methodology, and architectural recommendations: