Abstract
Modern machine learning systems rely on large datasets to attain broadgeneralization, and this often poses a challenge in robot learning, where eachrobotic platform and task might have only a small dataset. By training a singlepolicy across many different kinds of robots, a robot learning method canleverage much broader and more diverse datasets, which in turn can lead tobetter generalization and robustness. However, training a single policy onmulti-robot data is challenging because robots can have widely varying sensors,actuators, and control frequencies. We propose CrossFormer, a scalable andflexible transformer-based policy that can consume data from any embodiment. Wetrain CrossFormer on the largest and most diverse dataset to date, 900Ktrajectories across 20 different robot embodiments. We demonstrate that thesame network weights can control vastly different robots, including single anddual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds.Unlike prior work, our model does not require manual alignment of theobservation or action spaces. Extensive experiments in the real world show thatour method matches the performance of specialist policies tailored for eachembodiment, while also significantly outperforming the prior state of the artin cross-embodiment learning.