Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Abstract

Natural language is often the easiest and most convenient modality for humansto specify tasks for robots. However, learning to ground language to behaviortypically requires impractical amounts of diverse, language-annotateddemonstrations collected on each target robot. In this work, we aim to separatethe problem of what to accomplish from how to accomplish it, as the former canbenefit from substantial amounts of external observation-only data, and onlythe latter depends on a specific robot embodiment. To this end, we proposeVideo-Language Critic, a reward model that can be trained on readily availablecross-embodiment data using contrastive learning and a temporal rankingobjective, and use it to score behavior traces from a separate reinforcementlearning actor. When trained on Open X-Embodiment data, our reward modelenables 2x more sample-efficient policy training on Meta-World tasks than asparse reward only, despite a significant domain gap. Using in-domain data butin a challenging task generalization setting on Meta-World, we furtherdemonstrate more sample-efficient training than is possible with priorlanguage-conditioned reward models that are either trained with binaryclassification, use static images, or do not leverage the temporal informationpresent in video data.

Quick Read (beta)

loading the full paper ...