DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Abstract

Recent advancements in generative models have provided promising solutionsfor synthesizing realistic driving videos, which are crucial for trainingautonomous driving perception models. However, existing approaches oftenstruggle with multi-view video generation due to the challenges of integrating3D information while maintaining spatial-temporal consistency and effectivelylearning from a unified model. In this paper, we propose an end-to-endframework named DriveScape for multi-view, 3D condition-guided videogeneration. DriveScape not only streamlines the process by integrating cameradata to ensure comprehensive spatial-temporal coverage, but also introduces aBi-Directional Modulated Transformer module to effectively align 3D roadstructural information. As a result, our approach enables precise control overvideo generation, significantly enhancing realism and providing a robustsolution for generating multi-view driving videos. Our framework achievesstate-of-the-art results on the nuScenes dataset, demonstrating impressivegenerative quality metrics with an FID score of 8.34 and an FVD score of 76.39,as well as superior performance across various perception tasks. This paves theway for more accurate environmental simulations in autonomous driving. Codewill be available at\href{https://metadrivescape.github.io/papers_project/drivescapev1/index.html}{ourproject homepage}.

Quick Read (beta)

loading the full paper ...