Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

Abstract

Tokenization is a fundamental preprocessing step in NLP, directly impactinglarge language models' (LLMs) ability to capture syntactic, morphosyntactic,and semantic structures. This paper introduces a novel framework forsystematically evaluating tokenization strategies, addressing challenges inmorphologically rich and low-resource languages. Using a Turkish dataset of6,200 multiple-choice questions from the Massive Multitask LanguageUnderstanding (MMLU) benchmark, the framework assesses tokenizers across fivekey metrics: vocabulary size, token count, processing time, language-specifictoken percentages (\%TR), and token purity. These metrics provide a structuredapproach to evaluating how well tokenizers preserve linguistic structures.While \%TR measures the proportion of valid words in the target language,\%Pure assesses the alignment of tokens with meaningful linguistic units, suchas roots and valid morphemes, minimizing semantic fragmentation. The findingsreveal that \%TR, introduced as a critical metric, exhibits a strongercorrelation with downstream performance (e.g., MMLU scores) than token purity,emphasizing its role in improving model accuracy. Additionally, larger modelparameters do not necessarily yield better tokenization quality or enhancedresults, highlighting the importance of tailored tokenization strategies thatprioritize linguistic alignment. This framework sets a new standard fordeveloping robust tokenization methods optimized for morphologically complexand low-resource languages. Future work will refine morphological analysis,explore domain-specific customizations, and conduct cross-linguisticevaluations to further enhance tokenization practices.

Quick Read (beta)

loading the full paper ...