Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Abstract

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as SoftQ-Learning (SQL) and Soft Actor-Critic trade off reward and policy entropy,which has the potential to improve training stability and robustness. MostMaxEnt RL methods, however, use a constant tradeoff coefficient (temperature),contrary to the intuition that the temperature should be high early in trainingto avoid overfitting to noisy value estimates and decrease later in training aswe increasingly trust high value estimates to truly lead to good rewards.Moreover, our confidence in value estimates is state-dependent, increasingevery time we use more evidence to update an estimate. In this paper, wepresent a simple state-based temperature scheduling approach, and instantiateit for SQL as Count-Based Soft Q-Learning (CBSQL). We evaluate our approach ona toy domain as well as in several Atari 2600 domains and show promisingresults.

Quick Read (beta)

loading the full paper ...