Abstract
Multi-step (also called n-step) methods in reinforcement learning (RL) havebeen shown to be more efficient than the 1-step method due to fasterpropagation of the reward signal, both theoretically and empirically, in tasksexploiting tabular representation of the value-function. Recently, research inDeep Reinforcement Learning (DRL) also shows that multi-step methods improvelearning speed and final performance in applications where the value-functionand policy are represented with deep neural networks. However, there is a lackof understanding about what is actually contributing to the boost ofperformance. In this work, we analyze the effect of multi-step methods onalleviating the overestimation problem in DRL, where multi-step experiences aresampled from a replay buffer. Specifically building on top of DeepDeterministic Policy Gradient (DDPG), we propose Multi-step DDPG (MDDPG), wheredifferent step sizes are manually set, and its variant called Mixed Multi-stepDDPG (MMDDPG) where an average over different multi-step backups is used asupdate target of Q-value function. Empirically, we show that both MDDPG andMMDDPG are significantly less affected by the overestimation problem than DDPGwith 1-step backup, which consequently results in better final performance andlearning speed. We also discuss the advantages and disadvantages of differentways to do multi-step expansion in order to reduce approximation error, andexpose the tradeoff between overestimation and underestimation that underliesoffline multi-step methods. Finally, we compare the computational resourceneeds of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-artalgorithm proposed to address overestimation in actor-critic methods, and ourproposed methods, since they show comparable final performance and learningspeed.