Pre-training Auto-regressive Robotic Models with 4D Representations

Abstract

Foundation models pre-trained on massive unlabeled datasets haverevolutionized natural language and computer vision, exhibiting remarkablegeneralization capabilities, thus highlighting the importance of pre-training.Yet, efforts in robotics have struggled to achieve similar success, limited byeither the need for costly robotic annotations or the lack of representationsthat effectively model the physical world. In this paper, we introduce ARM4R,an Auto-regressive Robotic Model that leverages low-level 4D Representationslearned from human video data to yield a better pre-trained robotic model.Specifically, we focus on utilizing 3D point tracking representations fromvideos derived by lifting 2D representations into 3D space via monocular depthestimation across time. These 4D representations maintain a shared geometricstructure between the points and robot state representations up to a lineartransformation, enabling efficient transfer learning from human video data tolow-level robotic control. Our experiments show that ARM4R can transferefficiently from human video data to robotics and consistently improvesperformance on tasks across various robot environments and configurations.

Quick Read (beta)

loading the full paper ...