Abstract
Acquiring physically plausible motor skills across diverse and unconventionalmorphologies-including humanoid robots, quadrupeds, and animals-is essentialfor advancing character simulation and robotics. Traditional methods, such asreinforcement learning (RL) are task- and body-specific, require extensivereward function engineering, and do not generalize well. Imitation learningoffers an alternative but relies heavily on high-quality expert demonstrations,which are difficult to obtain for non-human morphologies. Video diffusionmodels, on the other hand, are capable of generating realistic videos ofvarious morphologies, from humans to ants. Leveraging this capability, wepropose a data-independent approach for skill acquisition that learns 3D motorskills from 2D-generated videos, with generalization capability tounconventional and non-human forms. Specifically, we guide the imitationlearning process by leveraging vision transformers for video-based comparisonsby calculating pair-wise distance between video embeddings. Along withvideo-encoding distance, we also use a computed similarity between segmentedvideo frames as a guidance reward. We validate our method on locomotion tasksinvolving unique body configurations. In humanoid robot locomotion tasks, wedemonstrate that 'No-data Imitation Learning' (NIL) outperforms baselinestrained on 3D motion-capture data. Our results highlight the potential ofleveraging generative video models for physically plausible skill learning withdiverse morphologies, effectively replacing data collection with datageneration for imitation learning.