Abstract
Recent models for video generation have achieved remarkable progress and arenow deployed in film, social media production, and advertising. Beyond theircreative potential, such models also hold promise as world simulators forrobotics and embodied decision making. Despite strong advances, however,current approaches still struggle to generate physically plausible objectinteractions and lack physics-grounded control mechanisms. To address thislimitation, we introduce KineMask, an approach for physics-guided videogeneration that enables realistic rigid body control, interactions, andeffects. Given a single image and a specified object velocity, our methodgenerates videos with inferred motions and future object interactions. Wepropose a two-stage training strategy that gradually removes future motionsupervision via object masks. Using this strategy we train video diffusionmodels (VDMs) on synthetic scenes of simple interactions and demonstratesignificant improvements of object interactions in real scenes. Furthermore,KineMask integrates low-level motion control with high-level textualconditioning via predictive scene descriptions, leading to effective supportfor synthesis of complex dynamical phenomena. Extensive experiments show thatKineMask achieves strong improvements over recent models of comparable size.Ablation studies further highlight the complementary roles of low- andhigh-level conditioning in VDMs. Our code, model, and data will be madepublicly available.