Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Abstract

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reducetoken fertility without compromising model performance. Our approach uses arule-based pre-tokenization method that aligns with the linguistic structure ofthe Korean language. We also create a seed vocabulary containing tokens thatresemble linguistic units and employ a branching entropy-based selectionalgorithm. These techniques increase the average token length, thus loweringfertility while preserving linguistic information. Experimental resultsindicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reducesthe number of tokens by 10%, improving the inference speed by 10%) compared toBPE without compromising performance across various downstream tasks. Thesefindings demonstrate that our linguistically informed approach is effective andpractical for designing efficient tokenizers for language models.

Quick Read (beta)

loading the full paper ...