Decoupling Value and Policy for Generalization in Reinforcement Learning

Abstract

Standard deep reinforcement learning algorithms use a shared representationfor the policy and value function. However, we argue that more information isneeded to accurately estimate the value function than to learn the optimalpolicy. Consequently, the use of a shared representation for the policy andvalue function can lead to overfitting. To alleviate this problem, we proposetwo approaches which are combined to create IDAAC: Invariant DecoupledAdvantage Actor-Critic. First, IDAAC decouples the optimization of the policyand value function, using separate networks to model them. Second, itintroduces an auxiliary loss which encourages the representation to beinvariant to task-irrelevant properties of the environment. IDAAC shows goodgeneralization to unseen environments, achieving a new state-of-the-art on theProcgen benchmark and outperforming popular methods on DeepMind Control taskswith distractors. Moreover, IDAAC learns representations, value predictions,and policies that are more robust to aesthetic changes in the observations thatdo not change the underlying state of the environment.

Quick Read (beta)

loading the full paper ...