Abstract
Developing robust and correctable visuomotor policies for roboticmanipulation is challenging due to the lack of self-recovery mechanisms fromfailures and the limitations of simple language instructions in guiding robotactions. To address these issues, we propose a scalable data generationpipeline that automatically augments expert demonstrations with failurerecovery trajectories and fine-grained language annotations for training. Wethen introduce Rich languAge-guided failure reCovERy (RACER), asupervisor-actor framework, which combines failure recovery data with richlanguage descriptions to enhance robot control. RACER features avision-language model (VLM) that acts as an online supervisor, providingdetailed language guidance for error correction and task execution, and alanguage-conditioned visuomotor policy as an actor to predict the next actions.Our experimental results show that RACER outperforms the state-of-the-artRobotic View Transformer (RVT) on RLbench across various evaluation settings,including standard long-horizon tasks, dynamic goal-change tasks and zero-shotunseen tasks, achieving superior performance in both simulated and real worldenvironments. Videos and code are available at:https://rich-language-failure-recovery.github.io.