Abstract
Tasks that require character-level reasoning, such as counting or locatingcharacters within words, remain challenging for contemporary language models. Acommon conjecture is that language models' reliance on subword units, ratherthan characters, contributes to their struggles with character-level tasks, yetrecent studies offer conflicting conclusions about the role of tokenization,leaving its impact unclear. To address this gap, we introduce CharBench, acomprehensive benchmark of character-level tasks that is two orders ofmagnitude larger than existing alternatives. We evaluate a diverse range ofleading open-weight and proprietary models on CharBench and find that itpresents a significant challenge to modern LLMs, with an average accuracy of43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsicproperties of words and their segmentations into tokens correspond to modelperformance. For counting tasks, we find that tokenization properties areweakly correlated with correctness, while the length of the queried word andthe actual character count play a more significant part. In contrast, for tasksrequiring intra-word positional understanding, performance is negativelycorrelated with the length of the token containing the queried character,suggesting that longer tokens obscure character position information for LLMs.We encourage future work to build on the benchmark and evaluation methodologyintroduced here as tools for improving model performance on such tasks.