Abstract
Learning visuomotor policies via imitation has proven effective across a widerange of robotic domains. However, the performance of these policies is heavilydependent on the number of training demonstrations, which requires expensivedata collection in the real world. In this work, we aim to reduce datacollection efforts when learning visuomotor robot policies by leveragingexisting or cost-effective data from a wide range of embodiments, such aspublic robot datasets and the datasets of humans playing with objects (humandata from play). Our approach leverages two key insights. First, we use opticflow as an embodiment-agnostic action representation to train a World Model(WM) across multi-embodiment datasets, and finetune it on a small amount ofrobot data from the target embodiment. Second, we develop a method, LatentPolicy Steering (LPS), to improve the output of a behavior-cloned policy bysearching in the latent space of the WM for better action sequences. In realworld experiments, we observe significant improvements in the performance ofpolicies trained with a small amount of data (over 50% relative improvementwith 30 demonstrations and over 20% relative improvement with 50demonstrations) by combining the policy with a WM pretrained on two thousandepisodes sampled from the existing Open X-embodiment dataset across differentrobots or a cost-effective human dataset from play.