Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Abstract

Value function estimation is an important task in reinforcement learning,i.e., prediction. The commonly used operator for prediction in Q-learning isthe hard max operator, which always commits to the maximum action-valueaccording to current estimation. Such `hard' updating scheme results in pureexploitation and may lead to misbehavior due to noise in stochasticenvironments. Thus, it is critical to balancing exploration and exploitation invalue function estimation. The Boltzmann softmax operator has a greatercapability in exploring potential action-values. However, it does not satisfythe non-expansion property, and its direct use may fail to converge even invalue iteration. In this paper, we propose to update the value function withdynamic Boltzmann softmax (DBS) operator in value function estimation, whichhas good convergence property in the setting of planning and learning.Moreover, we prove that dynamic Boltzmann softmax updates can eliminate theoverestimation phenomenon introduced by the hard max operator. Experimentalresults on GridWorld show that the DBS operator enables convergence and abetter trade-off between exploration and exploitation in value functionestimation. Finally, we propose the DBS-DQN algorithm by generalizing thedynamic Boltzmann softmax update in deep Q-network, which outperforms DQNsubstantially in 40 out of 49 Atari games.

Quick Read (beta)

loading the full paper ...