Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning

Abstract

Deep reinforcement learning (DRL) is one promising approach to teachingrobots to perform complex tasks. Because methods that directly reuse the storedexperience data cannot follow the change of the environment in robotic problemswith a time-varying environment, online DRL is required. The eligibility tracesmethod is well known as an online learning technique for improving sampleefficiency in traditional reinforcement learning with linear regressors ratherthan DRL. The dependency between parameters of deep neural networks woulddestroy the eligibility traces, which is why they are not integrated with DRL.Although replacing the gradient with the most influential one rather thanaccumulating the gradients as the eligibility traces can alleviate thisproblem, the replacing operation reduces the number of reuses of previousexperiences. To address these issues, this study proposes a new eligibilitytraces method that can be used even in DRL while maintaining high sampleefficiency. When the accumulated gradients differ from those computed using thelatest parameters, the proposed method takes into account the divergencebetween the past and latest parameters to adaptively decay the eligibilitytraces. Bregman divergences between outputs computed by the past and latestparameters are exploited due to the infeasible computational cost of thedivergence between the past and latest parameters. In addition, a generalizedmethod with multiple time-scale traces is designed for the first time. Thisdesign allows for the replacement of the most influential adaptivelyaccumulated (decayed) eligibility traces.

Quick Read (beta)

loading the full paper ...