CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-horizon Robot Manipulation Tasks

Abstract

General-purpose robots coexisting with humans in their environment must learnto relate human language to their perceptions and actions to be useful in arange of daily tasks. Moreover, they need to acquire a diverse repertoire ofgeneral-purpose skills that allow composing long-horizon tasks by followingunconstrained language instructions. In this paper, we present CALVIN(Composing Actions from Language and Vision), an open-source simulatedbenchmark to learn long-horizon language-conditioned tasks. Our aim is to makeit possible to develop agents that can solve many robotic manipulation tasksover a long horizon, from onboard sensors, and specified only via humanlanguage. CALVIN tasks are more complex in terms of sequence length, actionspace, and language than existing vision-and-language task datasets andsupports flexible specification of sensor suites. We evaluate the agents inzero-shot to novel language instructions and to novel environments and objects.We show that a baseline model based on multi-context imitation learningperforms poorly on CALVIN, suggesting that there is significant room fordeveloping innovative agents that learn to relate human language to their worldmodels with this benchmark.

Quick Read (beta)

loading the full paper ...