Abstract
Natural language is often the easiest and most convenient modality for humansto specify tasks for robots. However, learning to ground language to behaviortypically requires impractical amounts of diverse, language-annotateddemonstrations collected on each target robot. In this work, we aim to separatethe problem of what to accomplish from how to accomplish it, as the former canbenefit from substantial amounts of external observation-only data, and onlythe latter depends on a specific robot embodiment. To this end, we proposeVideo-Language Critic, a reward model that can be trained on readily availablecross-embodiment data using contrastive learning and a temporal rankingobjective, and use it to score behavior traces from a separate actor. Whentrained on Open X-Embodiment data, our reward model enables 2x moresample-efficient policy training on Meta-World tasks than a sparse reward only,despite a significant domain gap. Using in-domain data but in a challengingtask generalization setting on Meta-World, we further demonstrate moresample-efficient training than is possible with prior language-conditionedreward models that are either trained with binary classification, use staticimages, or do not leverage the temporal information present in video data.