Abstract
Autoregressive transformers are spectacular models for short sequences butscale poorly to long sequences such as high-resolution images, podcasts, code,or books. We proposed Megabyte, a multi-scale decoder architecture that enablesend-to-end differentiable modeling of sequences of over one million bytes.Megabyte segments sequences into patches and uses a local submodel withinpatches and a global model between patches. This enables sub-quadraticself-attention, much larger feedforward layers for the same compute, andimproved parallelism during decoding -- unlocking better performance at reducedcost for both training and generation. Extensive experiments show that Megabyteallows byte-level models to perform competitively with subword models on longcontext language modeling, achieve state-of-the-art density estimation onImageNet, and model audio from raw files. Together, these results establish theviability of tokenization-free autoregressive sequence modeling at scale.