RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Abstract

The reward hypothesis states that all goals and purposes can be understood asthe maximization of a received scalar reward signal. However, in practice,defining such a reward signal is notoriously difficult, as humans are oftenunable to predict the optimal behavior corresponding to a reward function.Natural language offers an intuitive alternative for instructing reinforcementlearning (RL) agents, yet previous language-conditioned approaches eitherrequire costly supervision or test-time training given a language instruction.In this work, we present a new approach that uses a pretrained RL agent trainedusing only unlabeled, offline interactions--without task-specific supervisionor labeled trajectories--to get zero-shot test-time policy inference fromarbitrary natural language instructions. We introduce a framework comprisingthree steps: imagine, project, and imitate. First, the agent imagines asequence of observations corresponding to the provided language descriptionusing video generative models. Next, these imagined observations are projectedinto the target environment domain. Finally, an agent pretrained in the targetenvironment with unsupervised RL instantly imitates the projected observationsequence through a closed-form solution. To the best of our knowledge, ourmethod, RLZero, is the first approach to show direct language-to-behaviorgeneration abilities on a variety of tasks and environments without anyin-domain supervision. We further show that components of RLZero can be used togenerate policies zero-shot from cross-embodied videos, such as those availableon YouTube, even for complex embodiments like humanoids.

Quick Read (beta)

loading the full paper ...