Abstract
We present Language-Image Value learning (LIV), a unified objective forvision-language representation and reward learning from action-free videos withtext annotations. Exploiting a novel connection between dual reinforcementlearning and mutual information contrastive learning, the LIV objective trainsa multi-modal representation that implicitly encodes a universal value functionfor tasks specified as language or image goals. We use LIV to pre-train thefirst control-centric vision-language representation from large human videodatasets such as EpicKitchen. Given only a language or image goal, thepre-trained LIV model can assign dense rewards to each frame in videos ofunseen robots or humans attempting that task in unseen environments. Further,when some target domain-specific data is available, the same objective can beused to fine-tune and improve LIV and even other pre-trained representationsfor robotic control and reward specification in that domain. In our experimentson several simulated and real-world robot environments, LIV models consistentlyoutperform the best prior input state representations for imitation learning,as well as reward specification methods for policy synthesis. Our resultsvalidate the advantages of joint vision-language representation and rewardlearning within the unified, compact LIV framework.