Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Abstract

For a general-purpose robot to operate in reality, executing a broad range ofinstructions across various environments is imperative. Central to thereinforcement learning and planning for such robotic agents is a generalizablereward function. Recent advances in vision-language models, such as CLIP, haveshown remarkable performance in the domain of deep learning, paving the way foropen-domain visual recognition. However, collecting data on robots executingvarious language instructions across multiple environments remains a challenge.This paper aims to transfer video-language models with robust generalizationinto a generalizable language-conditioned reward function, only utilizing robotvideo data from a minimal amount of tasks in a singular environment. Unlikecommon robotic datasets used for training reward functions, humanvideo-language datasets rarely contain trivial failure videos. To enhance themodel's ability to distinguish between successful and failed robot executions,we cluster failure video features to enable the model to identify patternswithin. For each cluster, we integrate a newly trained failure prompt into thetext encoder to represent the corresponding failure mode. Ourlanguage-conditioned reward function shows outstanding generalization to newenvironments and new instructions for robot planning and reinforcementlearning.

Quick Read (beta)

loading the full paper ...