Learning Relative Return Policies With Upside-Down Reinforcement Learning

Abstract

Lately, there has been a resurgence of interest in using supervised learningto solve reinforcement learning problems. Recent work in this area has largelyfocused on learning command-conditioned policies. We investigate the potentialof one such method -- upside-down reinforcement learning -- to work withcommands that specify a desired relationship between some scalar value and theobserved return. We show that upside-down reinforcement learning can learn tocarry out such commands online in a tabular bandit setting and in CartPole withnon-linear function approximation. By doing so, we demonstrate the power ofthis family of methods and open the way for their practical use under morecomplicated command structures.

Quick Read (beta)

loading the full paper ...