Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

Abstract

We study the problem of learning a range of vision-based manipulation tasksfrom a large offline dataset of robot interaction. In order to accomplish this,humans need easy and effective ways of specifying tasks to the robot. Goalimages are one popular form of task specification, as they are already groundedin the robot's observation space. However, goal images also have a number ofdrawbacks: they are inconvenient for humans to provide, they can over-specifythe desired behavior leading to a sparse reward signal, or under-specify taskinformation in the case of non-goal reaching tasks. Natural language provides aconvenient and flexible alternative for task specification, but comes with thechallenge of grounding language in the robot's observation space. To scalablylearn this grounding we propose to leverage offline robot datasets (includinghighly sub-optimal, autonomously collected data) with crowd-sourced naturallanguage labels. With this data, we learn a simple classifier which predicts ifa change in state completes a language instruction. This provides alanguage-conditioned reward function that can then be used for offlinemulti-task RL. In our experiments, we find that on language-conditionedmanipulation tasks our approach outperforms both goal-image specifications andlanguage conditioned imitation techniques by more than 25%, and is able toperform visuomotor tasks from natural language, such as "open the right drawer"and "move the stapler", on a Franka Emika Panda robot.

Quick Read (beta)

loading the full paper ...