VIRAL: Vision-grounded Integration for Reward design And Learning

Abstract

The alignment between humans and machines is a critical challenge inartificial intelligence today. Reinforcement learning, which aims to maximize areward function, is particularly vulnerable to the risks associated with poorlydesigned reward functions. Recent advancements has shown that Large LanguageModels (LLMs) for reward generation can outperform human performance in thiscontext. We introduce VIRAL, a pipeline for generating and refining rewardfunctions through the use of multi-modal LLMs. VIRAL autonomously creates andinteractively improves reward functions based on a given environment and a goalprompt or annotated image. The refinement process can incorporate humanfeedback or be guided by a description generated by a video LLM, which explainsthe agent's policy in video form. We evaluated VIRAL in five Gymnasiumenvironments, demonstrating that it accelerates the learning of new behaviorswhile ensuring improved alignment with user intent. The source-code and demovideo are available at: https://github.com/VIRAL-UCBL1/VIRAL andhttps://youtu.be/Hqo82CxVT38.

Quick Read (beta)

loading the full paper ...