An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space

Abstract

Reinforcement learning tasks in real-world scenarios often involve large,high-dimensional action spaces, leading to challenges such as convergencedifficulties, instability, and high computational complexity. It is widelyacknowledged that traditional value-based reinforcement learning algorithmsstruggle to address these issues effectively. A prevalent approach involvesgenerating independent sub-actions within each dimension of the action space.However, this method introduces bias, hindering the learning of optimalpolicies. In this paper, we propose an advantage-based optimization method andan algorithm named Advantage Branching Dueling Q-network (ABQ). ABQincorporates a baseline mechanism to tune the action value of each dimension,leveraging the advantage relationship across different sub-actions. With thisapproach, the learned policy can be optimized for each dimension. Empiricalresults demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% morecumulative rewards in HalfCheetah, Ant, and Humanoid environments,respectively. Furthermore, ABQ exhibits competitive performance when comparedagainst two continuous action benchmark algorithms, DDPG and TD3.

Quick Read (beta)

loading the full paper ...