Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Abstract

We propose AdapTok, an adaptive temporal causal video tokenizer that canflexibly allocate tokens for different frames based on video content. AdapTokis equipped with a block-wise masking strategy that randomly drops tail tokensof each block during training, and a block causal scorer to predict thereconstruction quality of video frames using different numbers of tokens.During inference, an adaptive token allocation strategy based on integer linearprogramming is further proposed to adjust token usage given predicted scores.Such design allows for sample-wise, content-aware, and temporally dynamic tokenallocation under a controllable overall budget. Extensive experiments for videoreconstruction and generation on UCF-101 and Kinetics-600 demonstrate theeffectiveness of our approach. Without additional image data, AdapTokconsistently improves reconstruction quality and generation performance underdifferent token budgets, allowing for more scalable and token-efficientgenerative video modeling.

Quick Read (beta)

loading the full paper ...