Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos

Abstract

We present an approach for physical imitation from human videos for robotmanipulation tasks. The key idea of our method lies in explicitly exploitingthe kinematics and motion information embedded in the video to learn structuredrepresentations that endow the robot with the ability to imagine how to performmanipulation tasks in its own context. To achieve this, we design a perceptionmodule that learns to translate human videos to the robot domain followed byunsupervised keypoint detection. The resulting keypoint-based representationsprovide semantically meaningful information that can be directly used forreward computing and policy learning. We evaluate the effectiveness of ourapproach on five robot manipulation tasks, including reaching, pushing,sliding, coffee making, and drawer closing. Detailed experimental evaluationsdemonstrate that our method performs favorably against previous approaches.

Quick Read (beta)

loading the full paper ...