Abstract
Large policies pretrained on diverse robot datasets have the potential totransform robotic learning: instead of training new policies from scratch, suchgeneralist robot policies may be finetuned with only a little in-domain data,yet generalize broadly. However, to be widely applicable across a range ofrobotic learning scenarios, environments, and tasks, such policies need tohandle diverse sensors and action spaces, accommodate a variety of commonlyused robotic platforms, and finetune readily and efficiently to new domains. Inthis work, we aim to lay the groundwork for developing open-source, widelyapplicable, generalist policies for robotic manipulation. As a first step, weintroduce Octo, a large transformer-based policy trained on 800k trajectoriesfrom the Open X-Embodiment dataset, the largest robot manipulation dataset todate. It can be instructed via language commands or goal images and can beeffectively finetuned to robot setups with new sensory inputs and action spaceswithin a few hours on standard consumer GPUs. In experiments across 9 roboticplatforms, we demonstrate that Octo serves as a versatile policy initializationthat can be effectively finetuned to new observation and action spaces. We alsoperform detailed ablations of design decisions for the Octo model, fromarchitecture to training data, to guide future research on building generalistrobot models.