Abstract
This paper presents a novel framework for converting 2D videos to immersivestereoscopic 3D, addressing the growing demand for 3D content in immersiveexperience. Leveraging foundation models as priors, our approach overcomes thelimitations of traditional methods and boosts the performance to ensure thehigh-fidelity generation required by the display devices. The proposed systemconsists of two main steps: depth-based video splatting for warping andextracting occlusion mask, and stereo video inpainting. We utilize pre-trainedstable video diffusion as the backbone and introduce a fine-tuning protocol forthe stereo video inpainting task. To handle input video with varying lengthsand resolutions, we explore auto-regressive strategies and tiled processing.Finally, a sophisticated data processing pipeline has been developed toreconstruct a large-scale and high-quality dataset to support our training. Ourframework demonstrates significant improvements in 2D-to-3D video conversion,offering a practical solution for creating immersive content for 3D deviceslike Apple Vision Pro and 3D displays. In summary, this work contributes to thefield by presenting an effective method for generating high-qualitystereoscopic videos from monocular input, potentially transforming how weexperience digital media.