Abstract
Diffusion generative models have become the standard for producinghigh-quality, coherent video content, yet their slow inference speeds and highcomputational demands hinder practical deployment. Although both quantizationand sparsity can independently accelerate inference while maintaininggeneration quality, naively combining these techniques in existingtraining-free approaches leads to significant performance degradation due tothe lack of joint optimization. We introduce FPSAttention, a noveltraining-aware co-design of FP8 quantization and sparsity for video generation,with a focus on the 3D bi-directional attention mechanism. Our approachfeatures three key innovations: 1) A unified 3D tile-wise granularity thatsimultaneously supports both quantization and sparsity; 2) A denoisingstep-aware strategy that adapts to the noise schedule, addressing the strongcorrelation between quantization/sparsity errors and denoising steps; 3) Anative, hardware-friendly kernel that leverages FlashAttention and isimplemented with optimized Hopper architecture features for highly efficientexecution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBenchbenchmark, FPSAttention achieves a 7.09x kernel speedup for attentionoperations and a 4.96x end-to-end speedup for video generation compared to theBF16 baseline at 720p resolution-without sacrificing generation quality.