NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Abstract

Acquiring physically plausible motor skills across diverse and unconventionalmorphologies-including humanoid robots, quadrupeds, and animals-is essentialfor advancing character simulation and robotics. Traditional methods, such asreinforcement learning (RL) are task- and body-specific, require extensivereward function engineering, and do not generalize well. Imitation learningoffers an alternative but relies heavily on high-quality expert demonstrations,which are difficult to obtain for non-human morphologies. Video diffusionmodels, on the other hand, are capable of generating realistic videos ofvarious morphologies, from humans to ants. Leveraging this capability, wepropose a data-independent approach for skill acquisition that learns 3D motorskills from 2D-generated videos, with generalization capability tounconventional and non-human forms. Specifically, we guide the imitationlearning process by leveraging vision transformers for video-based comparisonsby calculating pair-wise distance between video embeddings. Along withvideo-encoding distance, we also use a computed similarity between segmentedvideo frames as a guidance reward. We validate our method on locomotion tasksinvolving unique body configurations. In humanoid robot locomotion tasks, wedemonstrate that 'No-data Imitation Learning' (NIL) outperforms baselinestrained on 3D motion-capture data. Our results highlight the potential ofleveraging generative video models for physically plausible skill learning withdiverse morphologies, effectively replacing data collection with datageneration for imitation learning.

Quick Read (beta)

loading the full paper ...