Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

Abstract

In recent years, much progress has been made in learning robotic manipulationpolicies that follow natural language instructions. Such methods typicallylearn from corpora of robot-language data that was either collected withspecific tasks in mind or expensively re-labelled by humans with rich languagedescriptions in hindsight. Recently, large-scale pretrained vision-languagemodels (VLMs) like CLIP or ViLD have been applied to robotics for learningrepresentations and scene descriptors. Can these pretrained models serve asautomatic labelers for robot data, effectively importing Internet-scaleknowledge into existing datasets to make them useful even for tasks that arenot reflected in their ground truth annotations? To accomplish this, weintroduce Data-driven Instruction Augmentation for Language-conditioned control(DIAL): we utilize semi-supervised language labels leveraging the semanticunderstanding of CLIP to propagate knowledge onto large datasets of unlabelleddemonstration data and then train language-conditioned policies on theaugmented datasets. This method enables cheaper acquisition of useful languagedescriptions compared to expensive human labels, allowing for more efficientlabel coverage of large-scale datasets. We apply DIAL to a challengingreal-world robotic manipulation domain where 96.5% of the 80,000 demonstrationsdo not contain crowd-sourced language annotations. DIAL enables imitationlearning policies to acquire new capabilities and generalize to 60 novelinstructions unseen in the original dataset.

Quick Read (beta)

loading the full paper ...