Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Abstract

One of the roadblocks for training generalist robotic models today isheterogeneity. Previous robot learning methods often collect data to train withone specific embodiment for one task, which is expensive and prone tooverfitting. This work studies the problem of learning policy representationsthrough heterogeneous pre-training on robot data across different embodimentsand tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT),which pre-train a large, shareable trunk of a policy neural network to learn atask and embodiment agnostic shared representation. This general architecturealigns the specific proprioception and vision inputs from distinct embodimentsto a short sequence of tokens and then processes such tokens to map to controlrobots for different tasks. Leveraging the recent large-scale multi-embodimentreal-world robotic datasets as well as simulation, deployed robots, and humanvideo datasets, we investigate pre-training policies across heterogeneity. Weconduct experiments to investigate the scaling behaviors of trainingobjectives, to the extent of 52 datasets. HPTs outperform several baselines andenhance the fine-tuned policy performance by over 20% on unseen tasks inmultiple simulator benchmarks and real-world settings. See the project website(https://liruiw.github.io/hpt/) for code and videos.

Quick Read (beta)

loading the full paper ...