Abstract
Learning robot control policies from human videos is a promising directionfor scaling up robot learning. However, how to extract action knowledge (oraction representations) from videos for policy learning remains a keychallenge. Existing action representations such as video frames, pixelflow, andpointcloud flow have inherent limitations such as modeling complexity or lossof information. In this paper, we propose to use object-centric 3D motion fieldto represent actions for robot learning from human videos, and present a novelframework for extracting this representation from videos for zero-shot control.We introduce two novel components in its implementation. First, a noveltraining pipeline for training a ''denoising'' 3D motion field estimator toextract fine object 3D motions from human videos with noisy depth robustly.Second, a dense object-centric 3D motion field prediction architecture thatfavors both cross-embodiment transfer and policy generalization to background.We evaluate the system in real world setups. Experiments show that our methodreduces 3D motion estimation error by over 50% compared to the latest method,achieve 55% average success rate in diverse tasks where prior approachesfail~($\lesssim 10$\%), and can even acquire fine-grained manipulation skillslike insertion.