Abstract
Natural language is perhaps the most versatile and intuitive way for humansto communicate tasks to a robot. Prior work on Learning from Play (LfP) [Lynchet al, 2019] provides a simple approach for learning a wide variety of roboticbehaviors from general sensors. However, each task must be specified with agoal image---something that is not practical in open-world environments. Inthis work we present a simple and scalable way to condition policies on humanlanguage instead. We extend LfP by pairing short robot experiences from playwith relevant human language after-the-fact. To make this efficient, weintroduce multicontext imitation, which allows us to train a single agent tofollow image or language goals, then use just language conditioning at testtime. This reduces the cost of language pairing to less than 1% of collectedrobot experience, with the majority of control still learned viaself-supervised imitation. At test time, a single agent trained in this mannercan perform many different robotic manipulation skills in a row in a 3Denvironment, directly from images, and specified only with natural language(e.g. "open the drawer...now pick up the block...now press the greenbutton..."). Finally, we introduce a simple technique that transfers knowledgefrom large unlabeled text corpora to robotic learning. We find that transfersignificantly improves downstream robotic manipulation. It also allows ouragent to follow thousands of novel instructions at test time in zero shot, in16 different languages. See videos of our experiments atlanguage-play.github.io