Abstract
We study the problem of imitating object interactions from Internet videos.This requires understanding the hand-object interactions in 4D, spatially in 3Dand over time, which is challenging due to mutual hand-object occlusions. Inthis paper we make two main contributions: (1) a novel reconstruction techniqueRHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4Dtrajectories of both the hand and the object using 2D image cues and temporalsmoothness constraints; (2) a system for imitating object interactions in aphysics simulator with reinforcement learning. We apply our reconstructiontechnique to 100 challenging Internet videos. We further show that we cansuccessfully imitate a range of different object interactions in a physicssimulator. Our object-centric approach is not limited to human-likeend-effectors and can learn to imitate object interactions using differentembodiments, like a robotic arm with a parallel jaw gripper.