Abstract
Upside Down Reinforcement Learning (UDRL) is a promising framework forsolving reinforcement learning problems which focuses on learningcommand-conditioned policies. In this work, we extend UDRL to the task oflearning a command-conditioned generator of deep neural network policies. Weaccomplish this using Hypernetworks - a variant of Fast Weight Programmers,which learn to decode input commands representing a desired expected returninto command-specific weight matrices. Our method, dubbed Upside DownReinforcement Learning with Policy Generators (UDRLPG), streamlines comparabletechniques by removing the need for an evaluator or critic to update theweights of the generator. To counteract the increased variance in last returnscaused by not having an evaluator, we decouple the sampling probability of thebuffer from the absolute number of policies in it, which, together with asimple weighting strategy, improves the empirical convergence of the algorithm.Compared with existing algorithms, UDRLPG achieves competitive performance andhigh returns, sometimes outperforming more complex architectures. Ourexperiments show that a trained generator can generalize to create policiesthat achieve unseen returns zero-shot. The proposed method appears to beeffective in mitigating some of the challenges associated with learning highlymultimodal functions. Altogether, we believe that UDRLPG represents a promisingstep forward in achieving greater empirical sample efficiency in RL. A fullimplementation of UDRLPG is publicly available athttps://github.com/JacopoD/udrlpg_