Multi-step Greedy Policies in Model-Free Deep Reinforcement Learning

Abstract

Multi-step greedy policies have been extensively used in model-basedReinforcement Learning (RL) and in the case when a model of the environment isavailable (e.g., in the game of Go). In this work, we explore the benefits ofmulti-step greedy policies in model-free RL when employed in the framework ofmulti-step Dynamic Programming (DP): multi-step Policy and Value Iteration.These algorithms iteratively solve short-horizon decision problems and convergeto the optimal solution of the original one. By using model-free algorithms assolvers of the short-horizon problems we derive fully model-free algorithmswhich are instances of the multi-step DP framework. As model-free algorithmsare prone to instabilities w.r.t. the decision problem horizon, this simpleapproach can help in mitigating these instabilities and results in an improvedmodel-free algorithms. We test this approach and show results on both discreteand continuous control problems.

Quick Read (beta)

loading the full paper ...