CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Abstract

In this work, we present CineMaster, a novel framework for 3D-aware andcontrollable text-to-video generation. Our goal is to empower users withcomparable controllability as professional film directors: precise placement ofobjects within the scene, flexible manipulation of both objects and camera in3D space, and intuitive layout control over the rendered frames. To achievethis, CineMaster operates in two stages. In the first stage, we design aninteractive workflow that allows users to intuitively construct 3D-awareconditional signals by positioning object bounding boxes and defining cameramovements within the 3D space. In the second stage, these controlsignals--comprising rendered depth maps, camera trajectories and object classlabels--serve as the guidance for a text-to-video diffusion model, ensuring togenerate the user-intended video content. Furthermore, to overcome the scarcityof in-the-wild datasets with 3D object motion and camera pose annotations, wecarefully establish an automated data annotation pipeline that extracts 3Dbounding boxes and camera trajectories from large-scale video data. Extensivequalitative and quantitative experiments demonstrate that CineMastersignificantly outperforms existing methods and implements prominent 3D-awaretext-to-video generation. Project page: https://cinemaster-dev.github.io/.

Quick Read (beta)

loading the full paper ...