Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos

Abstract

The analysis and use of egocentric videos for robotic tasks is madechallenging by occlusion due to the hand and the visual mismatch between thehuman hand and a robot end-effector. In this sense, the human hand presents anuisance. However, often hands also provide a valuable signal, e.g. the handpose may suggest what kind of object is being held. In this work, we propose toextract a factored representation of the scene that separates the agent (humanhand) and the environment. This alleviates both occlusion and mismatch whilepreserving the signal, thereby easing the design of models for downstreamrobotics tasks. At the heart of this factorization is our proposed VideoInpainting via Diffusion Model (VIDM) that leverages both a prior on real-worldimages (through a large-scale pre-trained diffusion model) and the appearanceof the object in earlier frames of the video (through attention). Ourexperiments demonstrate the effectiveness of VIDM at improving inpaintingquality on egocentric videos and the power of our factored representation fornumerous tasks: object detection, 3D reconstruction of manipulated objects, andlearning of reward functions, policies, and affordances from videos.

Quick Read (beta)

loading the full paper ...