Adaptive Length Image Tokenization via Recurrent Allocation

Abstract

Current vision systems typically assign fixed-length representations toimages, regardless of the information content. This contrasts with humanintelligence - and even large language models - which allocate varyingrepresentational capacities based on entropy, context and familiarity. Inspiredby this, we propose an approach to learn variable-length token representationsfor 2D images. Our encoder-decoder architecture recursively processes 2D imagetokens, distilling them into 1D latent tokens over multiple iterations ofrecurrent rollouts. Each iteration refines the 2D tokens, updates the existing1D latent tokens, and adaptively increases representational capacity by addingnew tokens. This enables compression of images into a variable number oftokens, ranging from 32 to 256. We validate our tokenizer using reconstructionloss and FID metrics, demonstrating that token count aligns with image entropy,familiarity and downstream task requirements. Recurrent token processing withincreasing representational capacity in each iteration shows signs of tokenspecialization, revealing potential for object / part discovery.

Quick Read (beta)

loading the full paper ...