Representations for Stable Off-Policy Reinforcement Learning

Abstract

Reinforcement learning with function approximation can be unstable and evendivergent, especially when combined with off-policy learning and Bellmanupdates. In deep reinforcement learning, these issues have been dealt withempirically by adapting and regularizing the representation, in particular withauxiliary tasks. This suggests that representation learning may provide a meansto guarantee stability. In this paper, we formally show that there are indeednontrivial state representations under which the canonical TD algorithm isstable, even when learning off-policy. We analyze representation learningschemes that are based on the transition matrix of a policy, such asproto-value functions, along three axes: approximation error, stability, andease of estimation. In the most general case, we show that a Schur basisprovides convergence guarantees, but is difficult to estimate from samples. Fora fixed reward function, we find that an orthogonal basis of the correspondingKrylov subspace is an even better choice. We conclude by empiricallydemonstrating that these stable representations can be learned using stochasticgradient descent, opening the door to improved techniques for representationlearning with deep networks.

Quick Read (beta)

loading the full paper ...