From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Abstract

Tokenization imposes a fixed granularity on the input text, freezing how alanguage model operates on data and how far in the future it predicts. BytePair Encoding (BPE) and similar schemes split text once, build a staticvocabulary, and leave the model stuck with that choice. We relax this rigidityby introducing an autoregressive U-Net that learns to embed its own tokens asit trains. The network reads raw bytes, pools them into words, then pairs ofwords, then up to 4 words, giving it a multi-scale view of the sequence. Atdeeper stages, the model must predict further into the future -- anticipatingthe next few words rather than the next byte -- so deeper stages focus onbroader semantic patterns while earlier stages handle fine details. Whencarefully tuning and controlling pretraining compute, shallow hierarchies tiestrong BPE baselines, and deeper hierarchies have a promising trend. Becausetokenization now lives inside the model, the same system can handlecharacter-level tasks and carry knowledge across low-resource languages.

Quick Read (beta)

loading the full paper ...