Taming Diffusion Transformer for Real-Time Mobile Video Generation

Abstract

Diffusion Transformers (DiT) have shown strong performance in videogeneration tasks, but their high computational cost makes them impractical forresource-constrained devices like smartphones, and real-time generation is evenmore challenging. In this work, we propose a series of novel optimizations tosignificantly accelerate video generation and enable real-time performance onmobile platforms. First, we employ a highly compressed variational autoencoder(VAE) to reduce the dimensionality of the input data without sacrificing visualquality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruningstrategy to shrink the model size to suit mobile platform while preservingcritical performance characteristics. Third, we develop an adversarial stepdistillation technique tailored for DiT, which allows us to reduce the numberof inference steps to four. Combined, these optimizations enable our model toachieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max,demonstrating the feasibility of real-time, high-quality video generation onmobile devices.

Quick Read (beta)

loading the full paper ...