Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Abstract

Video diffusion transformers (vDiTs) have made impressive progress intext-to-video generation, but their high computational demands present majorchallenges for practical deployment. While existing acceleration methods reduceworkload at various granularities, they often rely on heuristics, limitingtheir applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimalconfigurations for vDiT-based video generation. At its core, ASTRAEA proposes alightweight token selection mechanism and a memory-efficient, GPU-parallelsparse attention strategy, enabling linear reductions in execution time withminimal impact on generation quality. To determine optimal token reduction fordifferent timesteps, we further design a search framework that leverages aclassic evolutionary algorithm to automatically determine the distribution ofthe token budget effectively. Together, ASTRAEA achieves up to 2.4x inferencespeedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs)while retaining better video quality compared to the state-of-the-art methods(<0.5% loss on the VBench score compared to the baseline vDiT models).

Quick Read (beta)

loading the full paper ...