Abstract
Imitation learning is a powerful tool for training robot manipulationpolicies, allowing them to learn from expert demonstrations without manualprogramming or trial-and-error. However, common methods of data collection,such as human supervision, scale poorly, as they are time-consuming andlabor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomouslygenerate large-scale datasets of diverse demonstrations. In this work, we showthat the combination of large-scale datasets generated by TAMP supervisors andflexible Transformer models to fit them is a powerful paradigm for robotmanipulation. To that end, we present a novel imitation learning system calledOPTIMUS that trains large-scale visuomotor Transformer policies by imitating aTAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that isspecifically curated for imitation learning and can be used to train performanttransformer-based policies. In this paper, we present a thorough study of thedesign decisions required to imitate TAMP and demonstrate that OPTIMUS cansolve a wide variety of challenging vision-based manipulation tasks with over70 different objects, ranging from long-horizon pick-and-place tasks, to shelfand articulated object manipulation, achieving 70 to 80% success rates. Videoresults at https://mihdalal.github.io/optimus/