PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

Abstract

This paper introduces PG-Rainbow, a novel algorithm that incorporates adistributional reinforcement learning framework with a policy gradientalgorithm. Existing policy gradient methods are sample inefficient and rely onthe mean of returns when calculating the state-action value function,neglecting the distributional nature of returns in reinforcement learningtasks. To address this issue, we use an Implicit Quantile Network that providesthe quantile information of the distribution of rewards to the critic networkof the Proximal Policy Optimization algorithm. We show empirical results thatthrough the integration of reward distribution information into the policynetwork, the policy agent acquires enhanced capabilities to comprehensivelyevaluate the consequences of potential actions in a given state, facilitatingmore sophisticated and informed decision-making processes. We evaluate theperformance of the proposed algorithm in the Atari-2600 game suite, simulatedvia the Arcade Learning Environment (ALE).

Quick Read (beta)

loading the full paper ...