Abstract
This paper introduces Thunder-Tok, a new Korean tokenizer designed to reducetoken fertility without compromising model performance. Our approach uses arule-based pre-tokenization method that aligns with the linguistic structure ofthe Korean language. We also create a seed vocabulary containing tokens thatresemble linguistic units and employ a branching entropy-based selectionalgorithm. These techniques increase the average token length, thus loweringfertility while preserving linguistic information. Experimental resultsindicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reducesthe number of tokens by 10%, improving the inference speed by 10%) compared toBPE without compromising performance across various downstream tasks. Thesefindings demonstrate that our linguistically informed approach is effective andpractical for designing efficient tokenizers for language models.