Byte Pair Encoding is Suboptimal for Language Model Pretraining

Abstract

The success of pretrained transformer language models (LMs) in naturallanguage processing has led to a wide range of pretraining setups. Inparticular, these models employ a variety of subword tokenization methods, mostnotably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), theWordPiece method (Schuster and Nakajima, 2012), and unigram language modeling(Kudo, 2018), to segment text. However, to the best of our knowledge, theliterature does not contain a direct evaluation of the impact of tokenizationon language model pretraining. We analyze differences between BPE and unigramLM tokenization, finding that the latter method recovers subword units thatalign more closely with morphology and avoids problems stemming from BPE'sgreedy construction procedure. We then compare the fine-tuned task performanceof identical transformer masked language models pretrained with thesetokenizations. Across downstream tasks and two languages (English andJapanese), we find that the unigram LM tokenization method matches oroutperforms BPE. We hope that developers of future pretrained LMs will consideradopting the unigram LM method over the more prevalent BPE.

Quick Read (beta)

loading the full paper ...