Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments

Abstract

Visualizing optimization landscapes has led to many fundamental insights innumeric optimization, and novel improvements to optimization techniques.However, visualizations of the objective that reinforcement learning optimizes(the "reward surface") have only ever been generated for a small number ofnarrow contexts. This work presents reward surfaces and related visualizationsof 27 of the most widely used reinforcement learning environments in Gym forthe first time. We also explore reward surfaces in the policy gradientdirection and show for the first time that many popular reinforcement learningenvironments have frequent "cliffs" (sudden large drops in expected return). Wedemonstrate that A2C often "dives off" these cliffs into low reward regions ofthe parameter space while PPO avoids them, confirming a popular intuition forPPO's improved performance over previous methods. We additionally introduce ahighly extensible library that allows researchers to easily generate thesevisualizations in the future. Our findings provide new intuition to explain thesuccesses and failures of modern RL methods, and our visualizations concretelycharacterize several failure modes of reinforcement learning agents in novelways.

Quick Read (beta)

loading the full paper ...