Causal Estimation of Tokenisation Bias

Abstract

Modern language models are typically trained over subword sequences, butultimately define probabilities over character-strings. Ideally, the choice ofthe tokeniser -- which maps character-strings to subwords -- should not affectthe probability assigned to the underlying character-string; in practice, itdoes. We define this mismatch as tokenisation bias. In this work, we quantifyone particular type of tokenisation bias: the effect of including or not asubword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on theprobability a trained model assigns to the corresponding characters (i.e.,\textit{``hello''}). Estimating this effect is challenging because each modelis trained with only one tokeniser. We address this by framing tokenisationbias as a causal effect and estimating it using the regression discontinuitydesign. Specifically, we exploit the fact that tokenisation algorithms ranksubwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is anarbitrary cutoff point. As such, we can estimate a causal effect by comparingsimilar subwords around this cutoff. Experimentally, we find that tokenisationconsistently affects models' outputs across scales, vocabularies, andtokenisers. Notably, a subword's presence in a small model's vocabulary mayincrease its characters' probability by up to 17 times, highlightingtokenisation as a key design choice in language modelling.

Quick Read (beta)

loading the full paper ...