SuperBPE: Space Travel for Language Models

Abstract

The assumption across nearly all language model (LM) tokenization schemes isthat tokens should be subwords, i.e., contained within word boundaries. Whileproviding a seemingly reasonable inductive bias, is this common practicelimiting the potential of modern LMs? Whitespace is not a reliable delimiter ofmeaning, as evidenced by multi-word expressions (e.g., "by the way"),crosslingual variation in the number of words needed to express a concept(e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that donot use whitespace at all (e.g., Chinese). To explore the potential oftokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE,which incorporates a simple pretokenization curriculum into the byte-pairencoding (BPE) algorithm to first learn subwords, then superwords that bridgewhitespace. This brings dramatic improvements in encoding efficiency: whenfixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text withup to 33% fewer tokens than BPE on average. In experiments, we pretrain 8Btransformer LMs from scratch while fixing the model size, vocabulary size, andtrain compute, varying *only* the algorithm for learning the vocabulary. Ourmodel trained with SuperBPE achieves an average +4.0% absolute improvement overthe BPE baseline across 30 downstream tasks (including +8.2% on MMLU), whilesimultaneously requiring 27% less compute at inference time. In analysis, wefind that SuperBPE results in segmentations of text that are more uniform inper-token difficulty. Qualitatively, this may be because SuperBPE tokens oftencapture common multi-word expressions that function semantically as a singleunit. SuperBPE is a straightforward, local modification to tokenization thatimproves both encoding efficiency and downstream performance, yielding betterlanguage models overall.

Quick Read (beta)

loading the full paper ...