LIV: Language-Image Representations and Rewards for Robotic Control

  • 2023-06-01 18:52:23
  • Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh Jayaraman
  • 0

Abstract

We present Language-Image Value learning (LIV), a unified objective forvision-language representation and reward learning from action-free videos withtext annotations. Exploiting a novel connection between dual reinforcementlearning and mutual information contrastive learning, the LIV objective trainsa multi-modal representation that implicitly encodes a universal value functionfor tasks specified as language or image goals. We use LIV to pre-train thefirst control-centric vision-language representation from large human videodatasets such as EpicKitchen. Given only a language or image goal, thepre-trained LIV model can assign dense rewards to each frame in videos ofunseen robots or humans attempting that task in unseen environments. Further,when some target domain-specific data is available, the same objective can beused to fine-tune and improve LIV and even other pre-trained representationsfor robotic control and reward specification in that domain. In our experimentson several simulated and real-world robot environments, LIV models consistentlyoutperform the best prior input state representations for imitation learning,as well as reward specification methods for policy synthesis. Our resultsvalidate the advantages of joint vision-language representation and rewardlearning within the unified, compact LIV framework.

 

Quick Read (beta)

loading the full paper ...