Abstract
Reinforcement learning tasks in real-world scenarios often involve large,high-dimensional action spaces, leading to challenges such as convergencedifficulties, instability, and high computational complexity. It is widelyacknowledged that traditional value-based reinforcement learning algorithmsstruggle to address these issues effectively. A prevalent approach involvesgenerating independent sub-actions within each dimension of the action space.However, this method introduces bias, hindering the learning of optimalpolicies. In this paper, we propose an advantage-based optimization method andan algorithm named Advantage Branching Dueling Q-network (ABQ). ABQincorporates a baseline mechanism to tune the action value of each dimension,leveraging the advantage relationship across different sub-actions. With thisapproach, the learned policy can be optimized for each dimension. Empiricalresults demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% morecumulative rewards in HalfCheetah, Ant, and Humanoid environments,respectively. Furthermore, ABQ exhibits competitive performance when comparedagainst two continuous action benchmark algorithms, DDPG and TD3.