Abstract
Upside-Down Reinforcement Learning (UDRL) is an approach for solving RLproblems that does not require value functions and uses only supervisedlearning, where the targets for given inputs in a dataset do not change overtime. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) --which can be viewed as a simplified version of UDRL -- optimizes a lower boundon goal-reaching performance. This raises expectations that such algorithms mayenjoy guaranteed convergence to the optimal policy in arbitrary environments,similar to certain well-known traditional RL algorithms. Here we show that fora specific episodic UDRL algorithm (eUDRL, including GCSL), this is not thecase, and give the causes of this limitation. To do so, we first introduce ahelpful rewrite of eUDRL as a recursive policy update. This formulation helpsto disprove its convergence to the optimal policy for a wide class ofstochastic environments. Finally, we provide a concrete example of a verysimple environment where eUDRL diverges. Since the primary aim of this paper isto present a negative result, and the best counterexamples are the simplestones, we restrict all discussions to finite (discrete) environments, ignoringissues of function approximation and limited sample size.