Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

Abstract

Large Language Models (LLMs) based on transformer architectures haverevolutionized a variety of domains, with tokenization playing a pivotal rolein their pre-processing and fine-tuning stages. In multilingual models,particularly those tailored for Indic languages, effective tokenization iscrucial for optimizing performance. This paper presents a comprehensiveevaluation of tokenizers used by 12 LLMs across all 22 official languages ofIndia, with a focus on comparing the efficiency of their tokenizationprocesses. We employed the Normalized Sequence Length (NSL) as a key metric inour analysis. Our findings reveal that the SUTRA tokenizer outperforms allother models, including several Indic-specific models, excelling in 14languages. Notable insights include the SUTRA tokenizer's superior handling ofIndic languages, GPT-4o's advancement over its predecessor GPT-4 in processingIndian languages, and the limited performance of Project Indus in certainlanguages. This study underscores the critical importance of developingtargeted tokenization strategies for multilingual and Indic-centric models,laying the groundwork for future improvements in tokenizer design to enhancelinguistic coverage and model efficiency.

Quick Read (beta)

loading the full paper ...