The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Abstract

Off-policy deep reinforcement learning (RL) typically leverages replaybuffers for reusing past experiences during learning. This can help improvesample efficiency when the collected data is informative and aligned with thelearning objectives; when that is not the case, it can have the effect of"polluting" the replay buffer with data which can exacerbate optimizationchallenges in addition to wasting environment interactions due to wastefulsampling. We argue that sampling these uninformative and wasteful transitionscan be avoided by addressing the sunk cost fallacy, which, in the context ofdeep RL, is the tendency towards continuing an episode until termination. Toaddress this, we propose learn to stop (LEAST), a lightweight mechanism thatenables strategic early episode termination based on Q-value and gradientstatistics, which helps agents recognize when to terminate unproductiveepisodes early. We demonstrate that our method improves learning efficiency ona variety of RL algorithms, evaluated on both the MuJoCo and DeepMind ControlSuite benchmarks.

Quick Read (beta)

loading the full paper ...