Grounding Language with Visual Affordances over Unstructured Data

Abstract

Recent works have shown that Large Language Models (LLMs) can be applied toground natural language to a wide variety of robot skills. However, inpractice, learning multi-task, language-conditioned robotic skills typicallyrequires large-scale data collection and frequent human intervention to resetthe environment or help correcting the current policies. In this work, wepropose a novel approach to efficiently learn general-purposelanguage-conditioned robot skills from unstructured, offline and reset-freedata in the real world by exploiting a self-supervised visuo-lingual affordancemodel, which requires annotating as little as 1% of the total data withlanguage. We evaluate our method in extensive experiments both in simulated andreal-world robotic tasks, achieving state-of-the-art performance on thechallenging CALVIN benchmark and learning over 25 distinct visuomotormanipulation tasks with a single policy in the real world. We find that whenpaired with LLMs to break down abstract natural language instructions intosubgoals via few-shot prompting, our method is capable of completinglong-horizon, multi-tier tasks in the real world, while requiring an order ofmagnitude less data than previous approaches. Code and videos are available athttp://hulc2.cs.uni-freiburg.de

Quick Read (beta)

loading the full paper ...