Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices

Abstract

There is a growing demand for deploying large generative AI models on mobiledevices. For recent popular video generative models, however, the VariationalAutoEncoder (VAE) represents one of the major computational bottlenecks. Bothlarge parameter sizes and mismatched kernels cause out-of-memory errors orextremely slow inference on mobile devices. To address this, we propose alow-cost solution that efficiently transfers widely used video VAEs to mobiledevices. (1) We analyze redundancy in existing VAE architectures and getempirical design insights. By integrating 3D depthwise separable convolutionsinto our model, we significantly reduce the number of parameters. (2) Weobserve that the upsampling techniques in mainstream video VAEs are poorlysuited to mobile hardware and form the main bottleneck. In response, we proposea decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Buildingupon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3)We propose an efficient VAE decoder training method. Since only the decoder isused during deployment, we distill it to Turbo-VAED instead of retraining thefull VAE, enabling fast mobile adaptation with minimal performance loss. To ourknowledge, our method enables real-time 720p video VAE decoding on mobiledevices for the first time. This approach is widely applicable to most videoVAEs. When integrated into four representative models, with training cost aslow as $95, it accelerates original VAEs by up to 84.5x at 720p resolution onGPUs, uses as low as 17.5% of original parameter count, and retains 96.9% ofthe original reconstruction quality. Compared to mobile-optimized VAEs,Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality onthe iPhone 16 Pro. The code and models will soon be available athttps://github.com/hustvl/Turbo-VAED.

Quick Read (beta)

loading the full paper ...