Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Abstract

The pre-training of visual representations has enhanced the efficiency ofrobot learning. Due to the lack of large-scale in-domain robotic datasets,prior works utilize in-the-wild human videos to pre-train robotic visualrepresentation. Despite their promising results, representations from humanvideos are inevitably subject to distribution shifts and lack the dynamicsinformation crucial for task completion. We first evaluate various pre-trainedrepresentations in terms of their correlation to the downstream roboticmanipulation tasks (i.e., manipulation centricity). Interestingly, we find thatthe "manipulation centricity" is a strong indicator of success rates whenapplied to downstream tasks. Drawing from these findings, we proposeManipulation Centric Representation (MCR), a foundation representation learningframework capturing both visual features and the dynamics information such asactions and proprioceptions of manipulation tasks to improve manipulationcentricity. Specifically, we pre-train a visual encoder on the DROID roboticdataset and leverage motion-relevant data such as robot proprioceptive statesand actions. We introduce a novel contrastive loss that aligns visualobservations with the robot's proprioceptive state-action dynamics, combinedwith a behavior cloning (BC)-like actor loss to predict actions duringpre-training, along with a time contrastive loss. Empirical results across 4simulation domains with 20 tasks verify that MCR outperforms the strongestbaseline method by 14.8%. Moreover, MCR boosts the performance ofdata-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Projectwebsite: https://robots-pretrain-robots.github.io/.

Quick Read (beta)

loading the full paper ...