Boosting Camera Motion Control for Video Diffusion Transformers

Abstract

Recent advancements in diffusion models have significantly enhanced thequality of video generation. However, fine-grained control over camera poseremains a challenge. While U-Net-based models have shown promising results forcamera control, transformer-based diffusion models (DiT)-the preferredarchitecture for large-scale video generation - suffer from severe degradationin camera motion accuracy. In this paper, we investigate the underlying causesof this issue and propose solutions tailored to DiT architectures. Our studyreveals that camera control performance depends heavily on the choice ofconditioning methods rather than camera pose representations that is commonlybelieved. To address the persistent motion degradation in DiT, we introduceCamera Motion Guidance (CMG), based on classifier-free guidance, which boostscamera control by over 400%. Additionally, we present a sparse camera controlpipeline, significantly simplifying the process of specifying camera poses forlong videos. Our method universally applies to both U-Net and DiT models,offering improved camera control for video generation tasks.

Quick Read (beta)

loading the full paper ...