← Projects
NLP · Indigenous Languages

QuechuaTok

Morphology-Aware Tokenization for Southern Quechua
NLPLow-Resource LanguagesTokenizationIndigenous AI
1.7971
Fertility Score (lower = better)
6.3%
Improvement over BPE 8k
8M+
Quechua speakers served
0%
OOV Rate (guaranteed)
01 · problem

Southern Quechua is spoken by approximately 8 million people across Peru, Bolivia, and Argentina — making it the most widely spoken indigenous language in the Americas. Yet standard BPE tokenizers fail it completely. A single Quechua word like mikhusqaykichikpim — meaning 'they say that all of you ate' — gets fragmented into morphologically meaningless pieces. High tokenizer fertility means language models see fewer words per context window, hide grammatical relationships, and degrade every downstream NLP task.

02 · approach

PRPE (Prefix-Root-Postfix Encoding) is a two-stage morphology-aware tokenization strategy. Stage 1: a rule-based morphological analyzer segments each Quechua word into its constituent morphemes — root, derivational suffixes, inflectional suffixes. Stage 2: BPE then operates within morpheme boundaries, never across them. The result: token splits that align with how Quechua grammar actually works. OOV rate is 0% by design — any unrecognized word falls back to character-level tokenization.

03 · results

PRPE achieves a fertility of 1.7971 — a 6.3% reduction over BPE 8k (1.9177), which uses double the vocabulary size. Unigram 4k performs worst at 2.3178. The gap between vocabulary scaling (BPE 4k → 8k = 0.315 points) is larger than the gap between BPE 8k and PRPE (0.120 points), suggesting morphological induction is complementary to vocabulary scaling — not a substitute.

ModelFertilityOOV%Δ vs PRPE
BPE 4k2.23310.0%+24.3%
BPE 8k1.91770.0%+6.7%
Unigram 4k2.31780.0%+29.0%
Unigram 8k1.95330.0%+8.7%
PRPE morphological1.79710.0%*
04 · impact

Southern Quechua is one of hundreds of agglutinative languages — Finnish, Turkish, Swahili, Nahuatl, Aymara — that share this problem. The NLP infrastructure stack was built for analytic languages, and morphologically rich languages pay the penalty in every downstream metric. QuechuaTok is a proof of concept: a small amount of linguistic knowledge outperforms a double-vocabulary statistical tokenizer. The approach is transferable to any agglutinative language.

“The infrastructure of indigenous language NLP doesn't need to wait for massive labeled datasets. It needs the right inductive biases.”
Read the full paper →View on GitHub →