Abstract
As virtual reality gains popularity, the demand for controllable creation ofimmersive and dynamic omnidirectional videos (ODVs) is increasing. Whileprevious text-to-ODV generation methods achieve impressive results, theystruggle with content inaccuracies and inconsistencies due to reliance solelyon textual inputs. Although recent motion control techniques providefine-grained control for video generation, directly applying these methods toODVs often results in spatial distortion and unsatisfactory performance,especially with complex spherical motions. To tackle these challenges, wepropose OmniDrag, the first approach enabling both scene- and object-levelmotion control for accurate, high-quality omnidirectional image-to-videogeneration. Building on pretrained video diffusion models, we introduce anomnidirectional control module, which is jointly fine-tuned with temporalattention layers to effectively handle complex spherical motion. In addition,we develop a novel spherical motion estimator that accurately extractsmotion-control signals and allows users to perform drag-style ODV generation bysimply drawing handle and target points. We also present a new dataset, namedMove360, addressing the scarcity of ODV data with large scene and objectmotions. Experiments demonstrate the significant superiority of OmniDrag inachieving holistic scene-level and fine-grained object-level control for ODVgeneration. The project page is available athttps://lwq20020127.github.io/OmniDrag.