Abstract
We address the problem of generating videos from unposed internet photos. Ahandful of input images serve as keyframes, and our model interpolates betweenthem to simulate a path moving between the cameras. Given random images, amodel's ability to capture underlying geometry, recognize scene identity, andrelate frames in terms of camera position and orientation reflects afundamental understanding of 3D structure and scene layout. However, existingvideo models such as Luma Dream Machine fail at this task. We design aself-supervised method that takes advantage of the consistency of videos andvariability of multiview internet photos to train a scalable, 3D-aware videomodel without any 3D annotations such as camera parameters. We validate thatour method outperforms all baselines in terms of geometric and appearanceconsistency. We also show our model benefits applications that enable cameracontrol, such as 3D Gaussian Splatting. Our results suggest that we can scaleup scene-level 3D learning using only 2D data such as videos and multiviewinternet photos.