### Abstract

Deep reinforcement learning (DRL) on Markov decision processes (MDPs) withcontinuous action spaces is often approached by directly updating parametricpolicies along the direction of estimated policy gradients (PGs). Previousresearch revealed that the performance of these PG algorithms depends heavilyon the bias-variance tradeoff involved in estimating and using PGs. A notableapproach towards balancing this tradeoff is to merge both on-policy andoff-policy gradient estimations for the purpose of training stochasticpolicies. However this method cannot be utilized directly by sample-efficientoff-policy PG algorithms such as Deep Deterministic Policy Gradient (DDPG) andtwin-delayed DDPG (TD3), which have been designed to train deterministicpolicies. It is hence important to develop new techniques to merge multipleoff-policy estimations of deterministic PG (DPG). Driven by this researchquestion, this paper introduces elite DPG which will be estimated differentlyfrom conventional DPG to emphasize on the variance reduction effect at theexpense of increased learning bias. To mitigate the extra bias, policyconsolidation techniques will be developed to distill policy behavioralknowledge from elite trajectories and use the distilled generative model tofurther regularize policy training. Moreover, we will study both theoreticallyand experimentally two different DPG merging methods, i.e., interpolationmerging and two-step merging, with the aim to induce varied bias-variancetradeoff through combined use of both conventional DPG and elite DPG.Experiments on six benchmark control tasks confirm that these two mergingmethods can noticeably improve the learning performance of TD3, significantlyoutperforming several state-of-the-art DRL algorithms.

### Quick Read (beta)

# Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning

###### Abstract

Deep reinforcement learning (DRL) on Markov decision processes (MDPs) with continuous action spaces is often approached by directly updating parametric policies along the direction of estimated policy gradients (PGs). Previous research revealed that the performance of these PG algorithms depends heavily on the bias-variance tradeoff involved in estimating and using PGs. A notable approach towards balancing this tradeoff is to merge both on-policy and off-policy gradient estimations for the purpose of training stochastic policies. However this method cannot be utilized directly by sample-efficient off-policy PG algorithms such as Deep Deterministic Policy Gradient (DDPG) and twin-delayed DDPG (TD3), which have been designed to train deterministic policies. It is hence important to develop new techniques to merge multiple off-policy estimations of deterministic PG (DPG). Driven by this research question, this paper introduces elite DPG which will be estimated differently from conventional DPG to emphasize on the variance reduction effect at the expense of increased learning bias. To mitigate the extra bias, policy consolidation techniques will be developed to distill policy behavioral knowledge from elite trajectories and use the distilled generative model to further regularize policy training. Moreover, we will study both theoretically and experimentally two different DPG merging methods, i.e., interpolation merging and two-step merging, with the aim to induce varied bias-variance tradeoff through combined use of both conventional DPG and elite DPG. Experiments on six benchmark control tasks confirm that these two merging methods can noticeably improve the learning performance of TD3, significantly outperforming several state-of-the-art DRL algorithms.

Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning

Gang Chen School of Engineering and Computer Science Victoria University of Wellington Wellington, New Zealand [email protected]

## 1 Introduction

Research on *reinforcement learning* (RL), in particular *deep reinforcement learning* (DRL) algorithms, has made remarkable progress in solving many challenging *Markov Decision Processes* (MDPs) with continuous state spaces and continuous action spaces [4, 23, 26, 25, 10, 46, 48]. Among all the state-of-the-art DRL algorithms, an important family aims at directly learning *action-selection policies* modelled as *deep neural networks* (DNNs) by iteratively applying *policy gradient* (PG) updates to the DNNs based on either (or both) on-policy^{1}^{1}
1
Environment samples are collected by using the same policy as the policy being trained by DRL algorithms. or (or and) off-policy^{2}^{2}
2
Environment samples are collected by using a policy that is different from the policy being trained by DRL algorithms. samples collected from the learning environment. In comparison to DRL algorithms that indirectly derive policies from learned *value functions*, policy gradient techniques have well-understood convergence properties and are naturally suited to handle continuous action spaces while achieving high sample efficiency on large-scale MDPs [5].

Previous research revealed that PG algorithms can be highly unstable and ineffective even with small changes to hyper-parameter settings [17, 14]. The unreliable learning performance can be attributed to several major issues, for example, ineffective environment exploration [30, 28, 29], misleading loss function [19], and the catastrophic forgetting phenomenon [21]. Numerous technologies have been proposed to address these issues. For instance, entropy-regularized PG algorithms have been developed to promote effective environment exploration and reshape the policy optimization landscape [8, 1, 13, 14]. Evolving differentiable loss functions for guided and fast DRL has been introduced in [19]. Meanwhile *policy consolidation* mechanisms have been examined in [21] to consolidate policy training over multiple timescales to mitigate catastrophic forgetting.

In addition to the above studies, it is widely known in the research community that DRL algorithms can be seriously affected by the bias-variance tradeoff involved in estimating and using PGs [45, 36, 31, 38]^{3}^{3}
3
More discussions on related research works can be found in Section 2.. A notable approach towards tackling this problem is to merge multiple gradient estimations, as demonstrated in [12, 11, 46]. In fact researchers have considered the benefits of merging on-policy estimation of PG with off-policy estimation to unleash the advantages of both [12]. Since on-policy PG applies primarily to stochastic policies, the merged PG is also utilized to train the same type of policies. Moreover, although on-policy PG is deemed stable and relatively easy to use, it suffers from poor data efficiency due to one-time use of environment samples. Consequently existing PG merging methods cannot be directly utilized by sample-efficient off-policy PG algorithms such as *Deep Deterministic Policy Gradient* (DDPG) [23] and its recent variation known as twin-delayed DDPG (TD3) [7], which have been designed to train deterministic policies. In view of this, it is important to explore the possibility of merging multiple off-policy estimations of *deterministic PG* (DPG)^{4}^{4}
4
DPG refers to the gradient of the expected performance of a parametric deterministic policy with respect to its trainable parameters. for enhanced learning efficiency and effectiveness. To the best of our knowledge, this research question has not attracted sufficient attention in the past.

Guided by this understanding, we seek to investigate two inter-related questions in this paper: (Q1) how to estimate DPG in an off-policy fashion with varied bias-variance tradeoff; and (Q2) how to merge multiple DPG estimations to effectively train a policy network. To answer Q1, we maintain two separate *experience replay buffers* (ERBs), i.e., the *full ERB* and the *elite ERB* inspired by an *elitism mechanism*. Similar to DDPG and TD3, the full ERB keeps track of all state samples obtained so far during the entire learning process whereas the elite ERB maintains an elite group of sampled *trajectories* with the highest *cumulative rewards*^{5}^{5}
5
The concept of trajectories and cumulative rewards will be discussed in Section 3.. The *deterministic policy gradient theorem* [39] subsequently enables us to obtain two separate estimations of DPG based on random samples collected from full ERB and elite ERB respectively. For ease of discussion, they will be called the *conventional DPG* and *elite DPG* accordingly.

Because conventional DPG is estimated by considering all environment samples obtained since the beginning of the learning process, it is expected to have lower bias but much higher variance than elite DPG. On the other hand, elite DPG aims at exploiting and improving best trajectories sampled so far. However the behavior of the policy under training might be very different from the policy behavior demonstrated by *elite trajectories*, depending on when and how these trajectories were sampled. Such behavioral deviation will inevitably introduce extra bias into the estimated elite DPG. To tackle this problem, inspired by [21], we develop a policy consolidation technique to distill policy behavioral knowledge from elite ERB with the help of a generative model such as the *variational autoencoder* (VAE) [6]. Elite DPG is further constrained by the behavioral deviation of trained policies from the VAE model.

To answer Q2, we study both theoretically and experimentally two different DPG merging methods. The first merging method follows the common *interpolation* technique that linearly combines conventional DPG with elite DPG [12, 11]. The second merging method introduces a new two-step iterative process to amplify the variance reduction effect of elite DPG. Theoretically, under suitable conditions and assumptions, we show that the *two-step merging* method can reduce the variance of trained *policy parameters* (i.e. trainable parameters belonging to a policy network) with controllable bias, contributing to improved learning reliability and performance. Meanwhile the *interpolation merging* method is expected to reduce the variances involved in policy training, in comparison to DDPG and TD3 that rely on conventional DPG alone. Our theoretical analysis is supported by experimental evaluations. On six difficult benchmark control tasks, we show that TD3 enhanced by two-step merging (TD3-2M) can noticeably outperform TD3 enhanced by interpolation merging (TD3-IM)^{6}^{6}
6
TD3-2M and TD3-IM are two new algorithms to be developed in Section 4., TD3 and other cutting-edge DRL algorithms, including Soft Actor-Critic (SAC) [14], Proximal Policy Optimization (PPO) [37] and Interpolated Policy Gradient (IPO), which is designed to merge on-policy and off-policy PG estimations [12].

## 2 Related Work

Merging multiple DRL methods has been frequently attempted in the literature. Q-Prop [11], PGQ [29], ACER [46] and IPG [12] are some of the recent examples with the aim to combine on-policy with off-policy learning. Q-Prop adopts DPG to train the mean action output of a stochastic policy based on an off-policy fitted critic. IPG follows the same vein and generalizes DPG with respect to the average performance of the stochastic policy being trained. Experimental study of IPG will be reported in Section 5. Different from these research works, this paper considers the task of learning deterministic policies by merging multiple off-policy estimations of DPG without involving on-policy learning. Moreover varied sampling strategies (e.g., the elitism mechanism) will be employed to collect environment samples from ERBs. This is related to the prioritized experience replay technique introduced in [18]. However, rather than focusing learning on “surprising experiences”, we adopt prioritized sampling to estimate DPG with differed bias-variance tradeoff. Another related idea is to mix different data resources, for example both on-policy and off-policy data, for reliable DRL [26, 24]. Specifically, to cope with distribution drift caused by off-policy data, *importance sampling* and *resamppling* techniques are often exploited to control the bias-variance tradeoff during policy evaluation and training [34, 43]. These techniques, however, are not necessary for our algorithms since we estimate DPG consistently in an off-policy fashion.

Huge efforts have been made to control the noise in PG estimation. This is often approached through enhancing the quality of critic learning. For example, a standard tool called *target network* has been widely adopted to tackle critic overestimation [16, 15]. Regularization technique based on averaged value estimates can also effectively reduce variance of learned value functions [2]. Meanwhile, [27] showed that stochastic policies can be trained with reduced variance by using smoothed critic. In this paper, we rely on time-tested variance reduction techniques adopted by TD3 to learn value functions. Any further investigation along this direction will be treated as future work. Besides critic training, the variance involved in policy training can be noticeably reduced with the help of some variance control techniques [38]. For example, a *stochastic variance-reduced policy gradient* (SVRPG) algorithm has been proposed in [31] to reuse past gradient computations to reduce the variance of the current estimation. SVRPG is relevant to TD3-2M to be developed in this paper since TD3-2M also introduces an iterative procedure to merge multiple DPG estimations. In comparison, SVRPG estimates PGs in the same way at each learning iteration by using REINFORCE [47] (and G(PO)MDP [3]) over multiple newly sampled trajectories. However REINFORCE is known to be sample inefficient and the PG estimation can exhibit high variance on MDPs with extremely long trajectories (including benchmark problems examined in this paper)^{7}^{7}
7
Our experimental study on SVRPG does not reveal superior performance of this algorithm in comparison to cutting-edge DRL algorithms such as PPO and SAC. Hence experiment results in relation to SVRPG will not be reported in this paper..

This paper adopts policy consolidation and knowledge distillation technologies to estimate elite DPGs. In particular policy consolidation is utilized as a regularization method to mitigate the estimation bias caused by behavioral deviation of the trained policy from elite trajectories. In the past, *KL-divergence* is commonly used to regularize behavioral deviation of stochastic policies [21, 35]. However, since KL-divergence does not apply to deterministic policies, we propose to directly measure the deviation in the multi-dimensional continuous action space as a new regularizer. To reduce correlation among environment samples for reliable off-policy learning and to enforce *smoothness*^{8}^{8}
8
According to the smoothness requirement, the trained policy is expected to behave similarly in nearby states. More information can be found in Section 4. of trained policies, we build a VAE-based generative model to capture essential behavioral knowledge embedded in the elite ERB and use the model to produce reference actions for regularized policy training. In the literature VAE and knowledge distillation techniques have been frequently explored to support meta-RL, transfer learning and multi-task learning [32, 20, 42]. We, instead, propose a new use of VAE for estimating DPGs.

## 3 Preliminaries

This section introduces the RL problem and the key components of an RL/DRL algorithm. TD3 will also be introduced briefly as the baseline algorithm in this paper.

### 3.1 Markov Decision Process and Actor-Critic Reinforcement Learning

An RL problem is often described in the form of an MDP. We are mainly interested in MDPs with continuous state spaces and continuous action spaces. At any time $t$, an agent in such an MDP can observe the current state of its learning environment, denoted as ${s}_{t}\in \mathbb{S}\subset {\mathbb{R}}^{n}$, in a $n$-dimensional state space. Based on its state observation, the agent can perform an action ${a}_{t}$ selected from an $m$-dimensional action space $\mathbb{A}\subset {\mathbb{R}}^{m}$. This causes an environment transition to a new state ${s}_{t+1}$ at time $t+1$, governed by unknown state-transition probability $\mathrm{Pr}({s}_{t},{s}_{t+1},{a}_{t})$. The environment also produces a scalar and bounded reward $r({s}_{t},{a}_{t})$ as its immediate feedback to the agent. Guided by a deterministic policy $\pi :\mathbb{S}\to \mathbb{A}$ that specifies the action $a$ to be performed in any state $s$, the agent can generate a trajectory $\tau ={\{({s}_{t}^{\tau},{a}_{t}^{\tau},{r}_{t}^{\tau})\}}_{t=0}^{\mathrm{\infty}}$ involving a sequence of consecutive state transitions over time, starting from an initial state ${s}_{0}$. The goal for RL is to identify the *optimal policy* ${\pi}^{*}$ that maximizes the expected long-term cumulative reward over all trajectories, as defined below:

$${\pi}^{*}=\underset{\pi}{\mathrm{arg}\mathrm{max}}J(\pi )=\underset{\pi}{\mathrm{arg}\mathrm{max}}\underset{\tau \sim \pi}{\mathbb{E}}\left[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t}^{\tau},{a}_{t}^{\tau})\right],$$ | (1) |

with $\gamma \in (0,1)$ being a discount factor. Many DRL algorithms including DDPG and TD3 have been developed to learn optimal policies under the *actor-critic* (AC) framework [14, 7, 9]. Following this framework, a DRL algorithm can be structured into two main components, i.e. the *actor* and the *critic*. The actor maintains and constantly improves a parametric policy ${\pi}_{\theta}$ with trainable parameters $\theta $. The critic is responsible for learning the value functions of ${\pi}_{\theta}$. Specifically the *Q-function* and *V-function* of ${\pi}_{\theta}$ are defined respectively as:

$${Q}^{{\pi}_{\theta}}(s,a)=J({\pi}_{\theta}|{s}_{0}=s,{a}_{0}=a)$$ | (2) |

and

$${V}^{{\pi}_{\theta}}(s)={Q}^{{\pi}_{\theta}}(s,{\pi}_{\theta}(s)).$$ | (3) |

In practice, off-policy training of value functions (e.g., the Q-function) can be conducted by the critic to minimize the Bellman loss below over a batch of environment samples $\mathcal{B}$ collected from the full ERB:

$${J}_{B}=\frac{1}{\parallel \mathcal{B}\parallel}\sum _{(s,{s}^{\prime},a,r)\in \mathcal{B}}{\left({Q}_{\omega}^{{\pi}_{\theta}}(s,a)-r(s,a)-\gamma {\widehat{Q}}_{\omega}^{{\pi}_{\theta}}({s}^{\prime},{\pi}_{\theta}({s}^{\prime}))\right)}^{2}.$$ | (4) |

Here we assume ${Q}_{\omega}^{{\pi}_{\theta}}$ is a parametric approximation of the Q-function with trainable parameters $\omega $. ${\widehat{Q}}_{\omega}^{{\pi}_{\theta}}$ in (4) represents the target Q-network for stable training of ${Q}_{\omega}^{{\pi}_{\theta}}$. Guided by the value functions learned by the critic, the actor in DDPG and TD3 proceeds to train policy ${\pi}_{\theta}$ by using DPGs estimated according to the deterministic policy gradient theorem below [39]:

$${\nabla}_{\theta}J({\pi}_{\theta})=\underset{{s}_{t}\sim {\rho}^{\beta}}{\mathbb{E}}\left[{{\nabla}_{\theta}{\pi}_{\theta}({s}_{t}){\nabla}_{a}{Q}^{{\pi}_{\theta}}({s}_{t},a)|}_{a={\pi}_{\theta}({s}_{t})}\right],$$ | (5) |

where $\beta $ refers to a *stochastic behavior policy* utilized to collect environment samples. It is usually defined in the form:

$$\forall s\in \mathbb{S},\beta (s)={\pi}_{\theta}(s)+{\mu}_{a},$$ | (6) |

with ${\mu}_{a}\sim \mathcal{N}(0,{\mathrm{\Sigma}}_{a})$ and ${\mathrm{\Sigma}}_{a}$ is the $m\times m$ diagonal covariance matrix that controls the scale of environment exploration. Hence ${\rho}^{\beta}$ in (5) stands for the *discounted state visitation distribution* associated with policy $\beta $ [40]. In practice, DDPG and TD3 use environment samples stored in the full ERB to estimate ${\nabla}_{\theta}J({\pi}_{\theta})$. Policy ${\pi}_{\theta}$ is therefore trained with conventional DPG alone in the two algorithms.

### 3.2 Twin-Delayed Deep Deterministic Policy Gradient

The performance of DDPG can be seriously affected by function approximation errors in the critic [17, 7]. In fact, the failure of DDPG is commonly associated with overestimation of the Q-function, which may cause divergence of the actor as it tends to magnify the error at successive learning iterations. To improve learning stability of DDPG, TD3 introduces three key innovations: (1) Clipped double Q-learning: TD3 learns two Q-functions in parallel and uses the smaller of the two to form the target Q-value in the Bellman loss defined in (4). This helps to reduce the risk of overestimating the Q-function. (2) Delayed policy update: TD3 trains policy less frequently than Q-function. As a rule of thumb, every two updates of Q-function will be followed by one update of policy. According to [7], the less frequent policy updates can benefit from low-variance estimation of Q-function, thereby improving the quality of policy training. (3) Target policy smoothing: TD3 adds noise to the target action ${\pi}_{\theta}({s}^{\prime})$ in (4) to explicitly smoothen the trained Q-function for the purpose of mitigating the impact of critic error on policy training.

The added noise in target policy smoothing is clipped to keep the target action close to the original action. This mechanism essentially serves as a regularizer for TD3. It enforces the notion that any two similar actions $a$ and ${a}^{\prime}$ should have similar Q-values in any state $s$. As a result, any error involved in approximating ${Q}^{\pi}(s,a)$ can be smoothed out across similar actions including ${a}^{\prime}$, encouraging an RL agent to select actions that are resistant to perturbations. In order to estimate elite DPGs in Section 4, we will explore a similar idea while building a regularizer to control the behavioral deviation of trained policies from elite trajectories. However, instead of adding small noise to actions, we will propose a new concept called *noisy sampled states*.

## 4 Merge Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff

This section is structured into three parts. We first study the approaches for estimating conventional DPG and elite DPG. Eligible methods for merging conventional DPG and elite DPG will be proposed and examined subsequently. Finally, the complete design of the TD3-IM and TD3-2M algorithms will be presented.

### 4.1 Estimate Deterministic Policy Gradient

As highlighted in the introduction, this paper studies two different approaches to estimate DPG. Similar to DDPG and TD3, the first approach aims at estimating DPG with low bias but potentially high variance. For this purpose, we randomly sample a batch of records denoted as ${\mathcal{B}}^{f}$ from the full ERB. Because these samples were obtained by using many different policies, we can effectively mitigate the possible learning bias induced by any specific policies. Hence, conventional DPG ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ can be estimated through (5) by replacing the expectation ${\mathbb{E}}_{{s}_{t}\sim {\rho}^{\beta}}$ in (5) with an average over all sampled states contained in ${\mathcal{B}}^{f}$. It has been widely reported that DDPG and TD3 can successfully train a policy towards its local optima by using ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ [23, 7]. In view of this, we can safely assume that the estimation bias in ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ is negligible. However, since only a small collection ${\mathcal{B}}^{f}$ is sampled from thousands or millions of records stored in the full ERB^{9}^{9}
9
In our experiments, full ERB contains up to 1M environment samples. On the other hand, the size of ${\mathcal{B}}^{f}$ is at most 256 as recommended in some related works [14]., ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ can vary substantially depending on the actual samples included in ${\mathcal{B}}^{f}$. Such high variance may noticeably slow down the learning progress.

To address this issue, a central focus of this subsection is to develop a new way of estimating DPG with low variance at the cost of increased bias. For this purpose, we must change the sampling strategy used to create the training batch. Specifically, instead of considering all records stored in the full ERB, only a restricted sub-collection of records should be considered. Meanwhile it is desirable for the newly estimated DPG to increase the chance for the trained policy to generate trajectories with high cumulative rewards. A straightforward approach towards meeting both requirements is to maintain a separate elite ERB and sample the *elite batch* denoted as ${\mathcal{B}}^{e}$ from the elite ERB to estimate DPG. Specifically, after sampling every new trajectory $\tau $ from the learning environment by using policy $\beta $, $\tau $ will be compared with all trajectories ${\tau}^{\prime}$ currently stored in the elite ERB in terms of their cumulative rewards. If $\tau $ has higher cumulative reward than some trajectory ${\tau}^{\prime}$, $\tau $ will replace ${\tau}^{\prime}$ in the elite ERB. In other words, elite ERB always contain the top $\kappa $^{10}^{10}
10
We set $\kappa $ to 30 for all benchmark problems to be studied experimentally in Section 5. We found that changing $\kappa $ to 20 or 40 did not seem to have any real impact on algorithm performance. Given that every trajectory can contain a maximum of 1000 state transitions and the maximum size of the full ERB is 1M samples, only a very small portion of samples in the full ERB will also appear in the elite ERB. trajectories with the highest cumulative rewards ever observed.

Using sampled states found on elite trajectories to compute the elite DPG ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ according to (5) does not necessarily increase the chance for an RL agent to re-produce these trajectories or trajectories with even higher cumulative rewards. Note that in (5) the action $a$ in any sampled state ${s}_{t}$ is determined completely by policy ${\pi}_{\theta}$ and ${\pi}_{\theta}({s}_{t})$ may deviate significantly from the sampled action ${a}_{t}$ recorded as part of the elite trajectories. Such behavioral deviation introduces non-negligible bias in the estimation of ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$. It also poses difficulties for the RL agent to progressively improve on its past success (i.e., elite trajectories), which is essential for reliable DRL. In view of this, it is necessary for us to design a new regularizer based on the policy consolidation mechanism to control behavioral deviation of trained policies. Specifically, since ${\pi}_{\theta}(s)$ produces an $m$-dimensional continuous action to be performed in any state $s$, the regularized ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ can be calculated as below:

$${\nabla}_{\theta}^{e}J({\pi}_{\theta})\approx {\frac{1}{\parallel {\mathcal{B}}^{e}\parallel}\sum _{(s,{s}^{\prime},a)\in {\mathcal{B}}^{e}}{\nabla}_{\theta}{\pi}_{\theta}(s){\nabla}_{b}{Q}^{{\pi}_{\theta}}(s,b)|}_{b={\pi}_{\theta}(s)}+\lambda {\nabla}_{\theta}{\parallel {\pi}_{\theta}(s)-a\parallel}_{2}^{2},$$ | (7) |

with ${\parallel a\parallel}_{2}$ representing the *${l}^{\mathrm{2}}$-norm* of any $m$-dimensional vector $a$ and $\lambda \ge 0$ serving as the scalar *regularization factor*. There are three possible problems associated with using (7). First, any sampled action recorded in the elite ERB was generated by a stochastic policy $\beta $ defined in (6) and is subject to a certain level of noise. Upon forcing the policy ${\pi}_{\theta}$ under training to follow such noisy actions, action-selection error made in any sampled state may be preserved by (7) over many learning iterations, resulting in slow performance improvement. Second, due to the limited number of elite trajectories stored in the elite ERB, state samples retrieved from these trajectories are more correlated than those collected from the full ERB. This may heavily bias policy training, resulting in unstable learning behavior. Third, (7) does not encourage *smoothness* over trained policies. It dictates a policy to follow sampled actions in sampled states but imposes no restriction on smooth policy behavior in nearby states. Hence the trained policy may not perform reliably well, given small perturbations to the learning environment.

To address the three problems above, we define *noisy sampled state* as a state sample retrieved from the elite ERB with added clipped noise, as described below:

$$\forall s\in {\mathcal{B}}^{e},\widehat{s}=s+clip({\mu}_{s},-{\mathrm{\Delta}}_{s},+{\mathrm{\Delta}}_{s}).$$ |

Here ${\mu}_{s}\sim \mathcal{N}(0,{\mathrm{\Sigma}}_{s})$ and ${\mathrm{\Sigma}}_{s}$ is a $n\times n$ diagonal covariance matrix that governs the level of noise to be added to any state samples. By clipping ${\mu}_{s}$, we ensure that $\widehat{s}$ remains in the nearby region of the state space centered around $s$. Behavioral regularization can now be performed on $\widehat{s}$. However, the reference action for $\widehat{s}$ cannot be determined immediately from the elite ERB. This issue can be resolved by distilling behavioral knowledge from the elite ERB in the form of a VAE model^{11}^{11}
11
In fact, it is possible to use any generative models other than VAE to distill behavioral knowledge from the elite ERB. However VAE is frequently utilized for knowledge extraction and is considered suitable for our learning problem.. We can proceed to use the VAE model to generate reference actions required by the regularizer. Accordingly, (7) can be re-formulated as below:

$${\nabla}_{\theta}^{e}J({\pi}_{\theta})\approx {\frac{1}{\parallel {\mathcal{B}}^{e}\parallel}\sum _{(s,{s}^{\prime},a)\in {\mathcal{B}}^{e}}{\nabla}_{\theta}{\pi}_{\theta}(\widehat{s}){\nabla}_{b}{Q}^{{\pi}_{\theta}}(\widehat{s},b)|}_{b={\pi}_{\theta}(\widehat{s})}+\lambda {\nabla}_{\theta}{\parallel {\pi}_{\theta}(\widehat{s})-\mathcal{V}(\widehat{s})\parallel}_{2}^{2},$$ | (8) |

with $\mathcal{V}$ referring to the VAE model, which will be trained regularly by using random batches of samples retrieved from the elite ERB. By replacing sampled action $a$ in (7) with generative action $\mathcal{V}(s)$ in (8), we can effectively reduce the long-term impact of any error embedded in sampled actions. Moreover, by using noisy state sample $\widehat{s}$ instead of $s$ in (8), states employed for training policy ${\pi}_{\theta}$ become less correlated and the smoothness of the trained policy can be enforced via the regularizer simultaneously.

### 4.2 Merge Deterministic Policy Gradient

Using the techniques developed in Subsection 4.1, we can estimate conventional DPG ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and elite DPG ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ with varied bias-variance tradeoff during each learning iteration. Due to the estimation bias inherent in ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$, it cannot be utilized alone to train ${\pi}_{\theta}$. Considering the possible joint use of both ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$, we can develop two distinct methods to merge them. We call these methods respectively the *interpolation merging* and the *two-step merging* methods. Interpolation merging linearly combines ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ with the help of a scalar *weight factor* $0\le \upsilon \le 1$, as described below:

$${\nabla}_{\theta}^{IM}J({\pi}_{\theta})=(1-\upsilon ){\nabla}_{\theta}^{c}J({\pi}_{\theta})+\upsilon {\nabla}_{\theta}^{e}J({\pi}_{\theta}).$$ | (9) |

In the sequel, we will use ${\nabla}_{\theta}^{IM}J({\pi}_{\theta})$ to denote DPG obtained through interpolation merging. Different from (9), it is possible for us to construct a two-step iterative process to merge conventional DPG and elite DPG. This is described in (10) as follows:

$${\nabla}_{\theta}^{2M}J({\pi}_{\theta})=(1-\upsilon ){\nabla}_{\theta}^{c}J({\pi}_{\theta})+\upsilon {\nabla}_{{\theta}^{\prime}}^{e}J({\pi}_{{\theta}^{\prime}}),$$ | (10) |

with

$${\theta}^{\prime}=\theta +\alpha (1-\upsilon ){\nabla}_{\theta}^{c}J({\pi}_{\theta}).$$ |

Here $$ stands for the *learning rate* for policy training. Note that ${\nabla}_{{\theta}^{\prime}}^{e}J({\pi}_{{\theta}^{\prime}})$ in (10) merges conventional and elite DPGs in an iterative manner via ${\theta}^{\prime}$. It is subsequently linearly merged with ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ to obtain ${\nabla}_{\theta}^{2M}J({\pi}_{\theta})$ as the outcome of two-step merging. In addition to ${\nabla}_{\theta}^{IM}J({\pi}_{\theta})$ and ${\nabla}_{\theta}^{2M}J({\pi}_{\theta})$, we will also consider the baseline of using ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ alone to train policy ${\pi}_{\theta}$ in order to understand the theoretical and practical benefits of merging. Consider specifically three separate learning rules:

$$\begin{array}{cc}\hfill {\theta}_{i+1}^{c}& \leftarrow {\theta}_{i}^{c}+\alpha {\nabla}_{{\theta}_{i}}^{c}J({\pi}_{{\theta}_{i}^{c}}),\hfill \\ \hfill {\theta}_{i+1}^{IM}& \leftarrow {\theta}_{i}^{IM}+\alpha {\nabla}_{{\theta}_{i}}^{IM}J({\pi}_{{\theta}_{i}^{IM}}),\hfill \\ \hfill {\theta}_{i+1}^{2M}& \leftarrow {\theta}_{i}^{2M}+\alpha {\nabla}_{{\theta}_{i}}^{2M}J({\pi}_{{\theta}_{i}^{2M}}),\hfill \end{array}$$ | (11) |

which will be used in the $i$-th learning iteration to obtain policy parameters ${\theta}_{i+1}$ through updating ${\theta}_{i}$. By studying the asymptotic trend of ${\theta}_{i}$ across a long sequence of learning iterations, we can deepen our understanding of the interpolation merging and two-step merging methods. Inspired by [50], a *noisy quadratic analysis* will be performed based on two *noisy quadratic models* defined below:

$$\begin{array}{c}\hfill {J}^{c}({\pi}_{\theta})\approx {J}^{*}-\frac{1}{2}{(\theta -{c}_{1})}^{T}A(\theta -{c}_{1}),\\ \hfill {J}^{e}({\pi}_{\theta})\approx {J}^{*}-\frac{1}{2}{(\theta -{c}_{2})}^{T}A(\theta -{c}_{2}).\end{array}$$ | (12) |

${J}^{c}({\pi}_{\theta})$ and ${J}^{e}({\pi}_{\theta})$ stand for noisy estimation of $J({\pi}_{\theta})$ when $\theta $ is close to ${\theta}^{*}$. We assume that the optimal policy parameters ${\theta}^{*}=0$, similar to [49, 33]. Based on ${J}^{c}({\pi}_{\theta})$ and ${J}^{e}({\pi}_{\theta})$, we can quantify the varied bias-variance tradeoffs that are inherent to ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ respectively. They also enable us to analyze the bias-variance tradeoff involved in using interpolation merging and two-step merging methods, giving theoretical clues over their effectiveness. Moreover, despite of being simple, ${J}^{c}({\pi}_{\theta})$ and ${J}^{e}({\pi}_{\theta})$ remain as challenging optimization targets in the literature [50, 33].

In (12), ${c}_{1}\sim \mathcal{N}(0,{\mathrm{\Sigma}}_{1})$ since ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ is considered unbiased in DDPG and TD3. On the other hand, ${c}_{2}\sim \mathcal{N}(\u03f5,{\mathrm{\Sigma}}_{2})$ with $\u03f5\ne 0$ that reflects the bias involved in estimating ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$. Both ${\mathrm{\Sigma}}_{1}$ and ${\mathrm{\Sigma}}_{2}$ are diagonal matrices with positive diagonal elements. Let $Diag(\mathrm{\Sigma})$ denote the vector of the diagonal elements of matrix $\mathrm{\Sigma}$. Since the variance involved in estimating ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ is assumed to be far less than that of ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$, we have $Diag({\mathrm{\Sigma}}_{1})\gg Diag({\mathrm{\Sigma}}_{2})$. Matrix $A$ is also diagonal with non-negative diagonal elements. It controls the sensitivity of $J({\pi}_{\theta})$ with respect to the DNN model of ${\pi}_{\theta}$. Given that the DNN model is identical in this analysis regardless of which learning rule in (11) is used, $A$ remains the same for both ${J}^{c}({\pi}_{\theta})$ and ${J}^{e}({\pi}_{\theta})$ in (12). Based on (12), the key findings of our noisy quadratic analysis have been summarized in Proposition 1 below. A proof of this proposition can be found in Appendix A.

###### Proposition 1

Based on the noisy quadratic models for ${\mathrm{\nabla}}_{\theta}^{c}\mathit{}J\mathit{}\mathrm{(}{\pi}_{\theta}\mathrm{)}$ and ${\mathrm{\nabla}}_{\theta}^{e}\mathit{}J\mathit{}\mathrm{(}{\pi}_{\theta}\mathrm{)}$ in (12) and provided that $\alpha $ is sufficiently small such that $$, policy parameters trained by using the three learning rules in (11) have the following convergence properties in expectation:

$$\begin{array}{cc}& \underset{i\to \mathrm{\infty}}{lim}\mathbb{E}[{\theta}_{i}^{c}]=0,\hfill \\ & \underset{i\to \mathrm{\infty}}{lim}\mathbb{E}[{\theta}_{i}^{IM}]=\upsilon \u03f5,\hfill \\ & \underset{i\to \mathrm{\infty}}{lim}\mathbb{E}[{\theta}_{i}^{2M}]=\alpha \upsilon {\left[I-(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)\right]}^{-1}A\u03f5.\hfill \end{array}$$ |

Moreover, the variances of trained policy parameters converge to the following fixed points:

$$\begin{array}{cc}& \underset{i\to \mathrm{\infty}}{lim}\mathbb{V}[{\theta}_{i}^{c}]={\alpha}^{2}{\left[I-{(I-\alpha A)}^{2}\right]}^{-1}{A}^{2}{\mathrm{\Sigma}}_{1},\hfill \\ & \underset{i\to \mathrm{\infty}}{lim}\mathbb{V}[{\theta}_{i}^{IM}]={\alpha}^{2}{\left[I-{(I-\alpha A)}^{2}\right]}^{-1}{A}^{2}\left({(1-\upsilon )}^{2}{\mathrm{\Sigma}}_{1}+{\upsilon}^{2}{\mathrm{\Sigma}}_{2}\right),\hfill \\ & \underset{i\to \mathrm{\infty}}{lim}\mathbb{V}[{\theta}_{i}^{2M}]={\alpha}^{2}{\left[I-{(I-\alpha \upsilon A)}^{2}{(I-\alpha (1-\upsilon )A)}^{2}\right]}^{-1}{A}^{2}\left({(1-\upsilon )}^{2}{(I-\alpha \upsilon A)}^{2}{\mathrm{\Sigma}}_{1}+{\upsilon}^{2}{\mathrm{\Sigma}}_{2}\right).\hfill \end{array}$$ |

Proposition 1 indicates that bias exists while training ${\theta}^{IM}$ and ${\theta}^{2M}$ according to (11) due to biased estimation of elite DPG ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$. However, the level of bias can be controlled via $\upsilon $. This suggests that we can gradually diminish $\upsilon $ during the learning process to eventually reduce the bias to its minimum. Having said that, our experiments show that it is safe to use fixed $\upsilon $ without hurting learning performance. Meanwhile, a comparison between $\mathbb{E}[{\theta}^{IM}]$ and $\mathbb{E}[{\theta}^{2M}]$ confirms that ${\theta}^{IM}$ enjoys less bias than ${\theta}^{2M}$. However this is at the expense of relatively higher variances. In fact, because $Diag({\mathrm{\Sigma}}_{2})\ll Diag({\mathrm{\Sigma}}_{1})$, it can be easily verified that $\mathbb{V}[{\theta}_{i}^{IM}]>\mathbb{V}[{\theta}_{i}^{2M}]$ for $$. Meanwhile $$. Therefore it can be concluded that the three learning rules in (11) provide varied bias-variance tradeoffs while training policy ${\pi}_{\theta}$. Their effectiveness will be experimentally studied in Section 5.

### 4.3 Twin-delayed DDPG with Merged Deterministic Policy Gradient

Driven by the learning rules introduced in (11) for updating ${\theta}^{IM}$ and ${\theta}^{2M}$, two new DRL algorithms based on TD3 can be developed. We name them TD3-IM and TD3-2M respectively. Algorithm 4.3 summarizes the execution procedure of the two algorithms as well as TD3. As shown in Algorithm 4.3, TD3, TD3-IM and TD3-2M adopt the same method to train Q-networks. Since Q-network training follows exactly the techniques developed in [7], the details are omitted. Meanwhile, only TD3-IM and TD3-2M require to maintain the elite ERB for the purpose of estimating the elite DPG. The two algorithms differ by the DPG merging methods used. Other parts of the three algorithms remain identical.

[!ht]
\[email protected]@algorithmic
\STATEInput: Initial policy network ${\pi}_{\theta}$, Q-network ${Q}_{\omega}$, and the corresponding target networks, the VAE generative model, full ERB and elite ERB
\STATErepeat
\STATE Observe current state $s$ and sample a noisy action $a$ according to policy $\beta (s)$ in (6).
\STATE Execute action $a$ in the environment.
\STATE Observe state transition $(s,{s}^{\prime},a,r)$ and keep the sampled transition in the full ERB.
\STATE if end of trajectory is reached:
\STATE Update the elite ERB and reset the environment.
\STATE if time to update:
\STATE for $j$ in range(number of updates):
\STATE Sample batch ${\mathcal{B}}^{f}$ from full ERB.
\STATE Sample batch ${\mathcal{B}}^{e}$ from elite ERB.
\STATE Use TD3 to train ${Q}_{\omega}$ based on ${\mathcal{B}}^{f}$.
\STATE Train the VAE generative model for TD3-IM and TD3-2M based on ${\mathcal{B}}^{e}$.
\STATE if $j\%\text{policy\_delay}=0$:
\STATE Estimate ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ on ${\mathcal{B}}^{f}$.
\STATE TD3: Update ${\theta}^{c}$ according to (11).
\STATE TD3-IM: Estimate ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ based on ${\mathcal{B}}^{e}$ and merge ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and ${\nabla}_{\theta}^{e}J({\pi}_{\theta})$ via (9).

Update ${\theta}^{IM}$ according to (11).
\STATE TD3-2M: Compute ${\theta}^{\prime}$ according to (10) and estimate ${\nabla}_{{\theta}^{\prime}}^{e}J({\pi}_{{\theta}^{\prime}})$ based on ${\mathcal{B}}^{e}$

Merge ${\nabla}_{\theta}^{c}J({\pi}_{\theta})$ and ${\nabla}_{{\theta}^{\prime}}^{e}J({\pi}_{{\theta}^{\prime}})$ via (10) and update ${\theta}^{2M}$ according to (11).
\STATE Update target policy networks and target Q-networks.

## 5 Experiments

Experiments have been performed on six benchmark control tasks, including Ant, Half Cheetah, Hopper, Lunar Lander, Walker2D and Bipedal Walker. We adopt a popular implementation of these benchmarks provided by OpenAI GYM^{12}^{12}
12
https://gym.openai.com and powered by the PyBullet physics engine [41]. Many previous studies utilized a different version of these benchmarks driven by the MuJoCo physics engine with varied system dynamics and reward schemes [44]. We prefer to use the PyBullet version since the PyBullet software package is publicly available with increasing popularity, whereas using MuJoCo is restricted to its license holders. Moreover PyBullet benchmarks are widely considered to be tougher to solve than their MuJoCo counterparts [41].

There are six main competing algorithms in our experiments. Two of them, including TD3-IM and TD3-2M, are newly developed in this paper. The rest four, i.e., IPG, TD3, SAC and PPO, are cutting-edge DRL algorithms with leading performance on many continuous action benchmarks, including the six problems to be investigated in this section. Besides the six algorithms, experimental evaluations will also be conducted on two variations of TD3-2M (and TD3-IM). Particularly, one variation enables us to compare the performances of using either (7) or (8) to estimate elite DPG, giving empirical evidences over the practical usefulness of adopting noisy sampled states and the VAE generative model to produce reference actions in the regularizer. To avoid confusion, we note that both TD3-IM and TD3-2M use (8) to estimate elite DPG by default. The special case of using (7) will be highlighted whenever necessary. Another variation allows us to verify whether the learning performance will be noticeably affected after dropping the regularizer component completely from (8).

Our experiments rely on the high-quality open source implementations of TD3, SAC and PPO provided by OpenAI Spinning Up^{13}^{13}
13
https://spinningup.openai.com/en/latest/. The TD3 code is also expanded to implement TD3-IM and TD3-2M according to Algorithm 4.3. Moreover, we adopt the reference implementation of IPG published by its inventors^{14}^{14}
14
https://github.com/shaneshixiang/rllabplusplus. In our experiments, each learning episode contains 1000 sampled state transitions. A learning algorithm can learn through 1000 episodes to find the best possible policy. To evaluate the learning performance reliably, 10 independent tests of each learning algorithm have been performed on every benchmark with 10 random seeds.

We follow closely [7] to determine the hyper-parameter settings of TD3. Meanwhile, [14], [37] and [12] provide guidance on hyper-parameter settings of SAC, PPO and IPG respectively. Although we prefer to use hyper-paramter settings recommended by the respective algorithm inventors, efforts have also been made to fine-tune some highly sensitive hyper-paramters of these algorithms to ensure that good learning performance can be obtained whenever possible. Since TD3-IM and TD3-2M are derived from TD3, the two new algorithms follow the hyper-parameter settings of TD3. TD3-IM and TD3-2M also introduce two additional hyper-parameters, i.e., the regularization factor $\lambda $ in (7) and (8) and the weight factor $\upsilon $ in (9) and (10). The influence of $\lambda $ and $\upsilon $ on learning performance will be empirically studied in this section. For more information on hyper-parameter settings of all competing algorithms, please refer to Appendix B.

Figure 1 compares the learning performance of all algorithms on six benchmark problems. As evidenced in this figure, TD3-2M achieved consistently the best performance on all benchmarks. Specifically, on three problems, i.e., Half Cheetah, Walker2D and Bipedal Walker, the final performance of TD3-2M is much better than other algorithms. Meanwhile, on problems such as Ant and Lunar Lander, TD3-2M achieved similar performance as TD3-IM and outperformed the rest. On Hopper, the two new algorithms also managed to achieve slightly better performance than SAC and TD3. Furthermore, by comparing TD3-IM and TD3-2M with TD3 as the baseline algorithm, we found that TD3-IM and TD3-2M either outperformed or performed similarly as TD3. Due to this observation, it can be confirmed that merging different DPG estimations with varied bias-variance tradeoff possesses clear performance advantage over an algorithm that trains policy networks with conventional DPG alone. We also noticed that TD3-2M performed more effectively than TD3-IM on all benchmarks. This suggests that two-step merging can nurture more desirable bias-variance tradeoff for policy training than interpolation merging. On the other hand, interpolation merging is more intuitive to use and has also been utilized to merge on-policy PG with off-policy PG in IPG [12]. Despite of using the same interpolation merging method, by merging conventional and elite DPGs, TD3-IM can significantly outperform IPG on most of the benchmarks.

In Figure 1, the performance of TD3-IM and TD3-2M was obtained by setting $\lambda $ to 0.1 and $\upsilon $ to 0.25. In addition, we have examined other settings of the two hyper-parameters. As demonstrated in Figure 2, on Half Cheetah, the performance of TD3-2M is not heavily influenced by $\upsilon $. Nevertheless, we prefer to set $\upsilon $ to a smaller value such as 0.25 since the corresponding performance curve appears to be more smooth. In comparison, when setting $\upsilon $ to a larger value such as 0.5 or 0.75, the learning performance may temporarily drop (e.g., the performance curve between learning episode 400 and 600 when $\upsilon =0.5$) or stagnate (e.g., the performance curve between learning episode 600 and 800 when $\upsilon =0.75$). This is likely to be caused by focusing too much on elite trajectories during policy training. Meanwhile, the influence of $\lambda $ seems to be stronger than $\upsilon $. With $\lambda =0.2$, TD3-2M achieved clearly better performance than when $\lambda =0.1$ or $\lambda =0.4$. Nevertheless, the performance may noticeably drop for a short while (e.g., between learning episode 500 and 600 with $\lambda =0.2$). For reliable learning, we prefer to set $\lambda $ to 0.1 although further performance improvement is possible as evidenced in Figure 2. While we studied only the influence of hyper-parameters $\lambda $ and $\upsilon $ on Half Cheetah in Figure 2, similar observations have been witnessed on other benchmarks. Meanwhile, we found that TD3-IM can achieve good performance by using identical settings of $\lambda $ and $\upsilon $ as TD3-2M.

Figure 3(a) compares the performance of TD3-2M upon using either (7) or (8) to estimate elite DPG. Our experiments clearly show the importance of using noisy sampled states and VAE-generated reference actions in (8). In particular, with the help of (8), steady performance improvement can be realized throughout the entire learning process. In comparison, learning slows down quickly after the initial phase when (7) is adopted by TD3-2M, suggesting that this algorithm version is more vulnerable to the bias and action sampling noises involved in using the elite trajectories directly. Meanwhile Figure 3(b) studies the usefulness of the regularizer term in (8). Particularly, without utilizing the regularizer, learning appears to be more biased, resulting in slow speed for improvement. On the other hand, the regularizer mitigates the impact of learning bias, allowing an RL agent to leverage on its past success to achieve better performance. These observations have been witnessed consistently on other benchmarks too.

## 6 Conclusion

In this paper we studied an important research question on how to estimate and merge DPGs with varied bias-variance tradeoff for effective DRL. Driven by this question, we introduced the elite DPG which will be estimated differently from conventional DPG by using an elitism mechanism. To cope with the potential learning bias caused by using elite DPG, we proposed a model-driven policy consolidation technique with the aim to extract policy behavioral knowledge from the elite trajectories to build a VAE generative model. This model was further employed to support noisy sampled states and regularized policy training. Meanwhile, we have theoretically and experimentally studied two DPG merging methods, i.e., interpolation merging and two-step merging. Both merging methods can be employed by a deep deterministic policy gradient algorithm such as TD3 to balance the bias-variance tradeoff during policy training. On six popularly used benchmark control tasks, we have also shown that these merging methods can noticeably improve the learning performance of TD3, significantly outperforming several state-of-the-art DRL algorithms such as PPO, SAC, IPG and TD3. While interpolation merging and two-step merging were exploited to merge DPGs in this paper, their efficacy in merging various PGs (e.g., on-policy PGs and off-policy PGs) in a wider context deserves further investigation. Moreover, the basic techniques we developed to compute elite DPG may possibly be extended to estimate PGs with respect to stochastic policies as well. More efforts are needed to explore the benefits of using such extended techniques in various PG algorithms.

## References

- [1] (2019) Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151–160. Cited by: §1.
- [2] (2017) Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 176–185. Cited by: §2.
- [3] (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, pp. 319–350. Cited by: §2.
- [4] (2009) Natural actor-critic algorithms. Journal Automatica 45 (11), pp. 2471–2482. Cited by: §1.
- [5] (2013) A survey on policy search for robotics. Foundations and Trends in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
- [6] (2016) Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §1.
- [7] (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §1, §3.1, §3.2, §4.1, §4.3, §5, Appendix B.
- [8] (2019) A theory of regularized markov decision processes. arXiv preprint arXiv:1901.11275. Cited by: §1.
- [9] (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. Cited by: §3.1.
- [10] (2016) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation, Cited by: §1.
- [11] (2016) Q-prop: sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247. Cited by: §1, §1, §2.
- [12] (2017) Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in neural information processing systems, pp. 3846–3855. Cited by: §1, §1, §2, §5, §5.
- [13] (2017) Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165. Cited by: §1.
- [14] (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In 35th International Conference on Machine Learning, Cited by: §1, §1, §3.1, §5, Appendix B, Appendix B, footnote 9.
- [15] (2010) Double Q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621. Cited by: §2.
- [16] (2016) Deep Reinforcement Learning with Double Q-Learning. In Thirtieth AAAI Conference on Artificial Intelligence, Vol. 16, pp. 2094–2100. Cited by: §2.
- [17] (2017) Deep reinforcement learning that matters. CoRR abs/1709.06560. External Links: Link Cited by: §1, §3.2.
- [18] (2018) Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933. Cited by: §2.
- [19] (2018) Evolved policy gradients. In Advances in Neural Information Processing Systems, pp. 5400–5409. Cited by: §1.
- [20] (2018) Continual reinforcement learning with complex synapses. arXiv preprint arXiv:1802.07239. Cited by: §2.
- [21] (2019) Policy consolidation for continual reinforcement learning. arXiv preprint arXiv:1902.00255. Cited by: §1, §1, §2.
- [22] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B.
- [23] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §4.1.
- [24] (2014) Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 3014–3022. Cited by: §2.
- [25] (2016) Asynchronous methods for deep reinforcement learning. In 33rd International Conference on Machine Learning, Cited by: §1.
- [26] (2016) Safe and efficient off-policy reinforcement learning. In 30th Conference on Neural Information Processing Systems, Cited by: §1, §2.
- [27] (2018) Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348. Cited by: §2.
- [28] (2017) Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2772–2782. Cited by: §1.
- [29] (2017) Combining policy gradient and Q-learning. arXiv preprint arXiv:1611.01626. Cited by: §1, §2.
- [30] (2016) Deep exploration via bootstrapped DQN. In 30th Conference on Neural Information Processing Systems, Cited by: §1.
- [31] (2018) Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618. Cited by: §1, §2.
- [32] (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254. Cited by: §2.
- [33] (2013) No more pesky learning rates. In International Conference on Machine Learning, pp. 343–351. Cited by: §4.2.
- [34] (2019) Importance resampling for off-policy prediction. arXiv preprint arXiv:1906.04328. Cited by: §2.
- [35] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §2.
- [36] (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §1, Appendix B.
- [37] (2017) Proximal policy optimization algorithms. CoRR. External Links: Link Cited by: §1, §5.
- [38] (2019) Regularized anderson acceleration for off-policy deep reinforcement learning. arXiv preprint arXiv:1909.03245. Cited by: §1, §2.
- [39] (2014) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Cited by: §1, §3.1.
- [40] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.1.
- [41] (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §5.
- [42] (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: §2.
- [43] (2016) Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §2.
- [44] (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §5.
- [45] (2018) The mirage of action-dependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031. Cited by: §1.
- [46] (2016) Sample efficient actor-critic with experience replay. CoRR abs/1611.01224. External Links: Link Cited by: §1, §1, §2.
- [47] (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.
- [48] (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In 31st Conference on Neural Information Processing Systems, Cited by: §1.
- [49] (2018) Understanding short-horizon bias in stochastic meta-optimization. arXiv preprint arXiv:1803.02021. Cited by: §4.2.
- [50] (2019) Lookahead optimizer: k steps forward, 1 step back. arXiv preprint arXiv:1907.08610. Cited by: §4.2.

## Appendix A

This appendix presents a proof of Proposition 1. Start with the learning rule for updating ${\theta}^{c}$ in (11). Considering the $i$-th learning iteration with arbitrary $i\ge 1$ and taking the expectation at both sides of the learning rule, we obtain

$$\mathbb{E}\left[{\theta}_{i+1}^{c}\right]=(I-\alpha A)\mathbb{E}\left[{\theta}_{i}^{c}\right].$$ |

Given the assumption that $\alpha $ is small so that $$, the equation above is a *contraction mapping* from $\mathbb{E}\left[{\theta}_{i}^{c}\right]$ to $\mathbb{E}\left[{\theta}_{i+1}^{c}\right]$. Hence $\mathbb{E}\left[{\theta}_{i}^{c}\right]$ converges to ${\theta}^{*}=0$ as $i$ approaches to $\mathrm{\infty}$. Therefore ${\theta}^{c}$ is trained bias-free. Furthermore,

$$\begin{array}{cc}\hfill \mathbb{V}\left[{\theta}_{i+1}^{c}\right]& =\mathbb{E}\left[{\left((I-\alpha A)({\theta}_{i}^{c}-\mathbb{E}[{\theta}_{i}^{c}])+\alpha A{c}_{1}\right)}^{2}\right]\hfill \\ & ={(I-\alpha A)}^{2}\mathbb{V}\left[{\theta}_{i}^{c}\right]+{\alpha}^{2}{A}^{2}{\mathrm{\Sigma}}_{1}.\hfill \end{array}$$ |

With $i$ approaching to $\mathrm{\infty}$, the fixed point of $\mathbb{V}\left[{\theta}_{\mathrm{\infty}}^{c}\right]$ must satisfy

$$\mathbb{V}\left[{\theta}_{\mathrm{\infty}}^{c}\right]={(I-\alpha A)}^{2}\mathbb{V}\left[{\theta}_{\mathrm{\infty}}^{c}\right]+{\alpha}^{2}{A}^{2}{\mathrm{\Sigma}}_{1}.$$ |

Solving the above equation gives rise to

$$\mathbb{V}\left[{\theta}_{\mathrm{\infty}}^{c}\right]={\alpha}^{2}{\left[I-{(I-\alpha A)}^{2}\right]}^{-1}{A}^{2}{\mathrm{\Sigma}}_{1}.$$ |

Following the same procedure, we can proceed to analyze the learning rule for ${\theta}^{IM}$ in (11). Specifically,

$$\mathbb{E}\left[{\theta}_{i+1}^{IM}\right]=(I-\alpha A)\mathbb{E}\left[{\theta}_{i}^{IM}\right]+\alpha \upsilon A\u03f5.$$ |

We can re-write the above as

$$\mathbb{E}\left[{\theta}_{i+1}^{IM}\right]-\upsilon \u03f5=(I-\alpha A)\left(\mathbb{E}\left[{\theta}_{i}^{IM}\right]-\upsilon \u03f5\right).$$ |

This is a contraction mapping from $\left(\mathbb{E}\left[{\theta}_{i}^{IM}\right]-\upsilon \u03f5\right)$ to $\left(\mathbb{E}\left[{\theta}_{i}^{IM}\right]-\upsilon \u03f5\right)$ that converges to 0. As a result, when $i\to \mathrm{\infty}$, $\mathbb{E}\left[{\theta}_{i}^{IM}\right]$ approaches to $\upsilon \u03f5$. Meanwhile, we can determine $\mathbb{V}\left[{\theta}_{\mathrm{\infty}}^{IM}\right]$ as follows:

$$\begin{array}{cc}\hfill \mathbb{V}\left[{\theta}_{i+1}^{IM}\right]& =\mathbb{E}\left[{\left((I-\alpha A)({\theta}_{i}^{IM}-\mathbb{E}[{\theta}_{i}^{IM}])+\alpha (1-\upsilon )A{c}_{1}+\alpha \upsilon A({c}_{2}-\u03f5)\right)}^{2}\right]\hfill \\ & ={(I-\alpha A)}^{2}\mathbb{V}\left[{\theta}_{i}^{IM}\right]+{\alpha}^{2}{(1-\upsilon )}^{2}{A}^{2}{\mathrm{\Sigma}}_{1}+{\alpha}^{2}{\upsilon}^{2}{A}^{2}{\mathrm{\Sigma}}_{2}.\hfill \end{array}$$ |

By solving this equation, we obtain

$$\mathbb{V}[{\theta}_{\mathrm{\infty}}^{IM}]={\alpha}^{2}{\left[I-{(I-\alpha A)}^{2}\right]}^{-1}{A}^{2}\left({(1-\upsilon )}^{2}{\mathrm{\Sigma}}_{1}+{\upsilon}^{2}{\mathrm{\Sigma}}_{2}\right).$$ |

Finally the analysis on the learning rule for ${\theta}^{2M}$ in (11) can be conducted in the same way. Particularly,

$$\mathbb{E}\left[{\theta}_{i+1}^{2M}\right]=(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)\mathbb{E}\left[{\theta}_{i}^{2M}\right]+\alpha \upsilon A\u03f5.$$ |

Let $\mathrm{\Omega}={\left[I-(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)\right]}^{-1}\alpha \upsilon $, the above can be re-written as

$$\begin{array}{cc}\hfill \mathbb{E}\left[{\theta}_{i+1}^{2M}\right]-\mathrm{\Omega}A\u03f5& =(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)\mathbb{E}\left[{\theta}_{i}^{2M}\right]+\alpha \upsilon A\u03f5-\mathrm{\Omega}A\u03f5\hfill \\ & =(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)\left(\mathbb{E}\left[{\theta}_{i}^{2M}\right]-\mathrm{\Omega}A\u03f5\right).\hfill \end{array}$$ |

Clearly, $$ and the mapping above is hence a contraction. This implies that $\mathbb{E}\left[{\theta}_{i}^{2M}\right]$ converges to $\mathrm{\Omega}A\u03f5$ with $i\to \mathrm{\infty}$. In the meantime, the fixed point of $\mathbb{V}\left[{\theta}^{2M}\right]$ can be determined by solving the equation below:

$$\begin{array}{cc}\hfill \mathbb{V}\left[{\theta}_{i+1}^{2M}\right]& =\mathbb{E}\left[{\left(\begin{array}{c}(I-\alpha \upsilon A)(I-\alpha (1-\upsilon )A)({\theta}_{i}^{2M}-\mathbb{E}[{\theta}_{i}^{2M}])+\alpha (1-\upsilon )A(I-\alpha \upsilon A){c}_{1}\hfill \\ +\alpha \upsilon A({c}_{2}-\u03f5)\hfill \end{array}\right)}^{2}\right]\hfill \\ & ={(I-\alpha \upsilon A)}^{2}{(I-\alpha (1-\upsilon )A)}^{2}\mathbb{V}\left[{\theta}_{i}^{IM}\right]+{\alpha}^{2}{(1-\upsilon )}^{2}{A}^{2}{(I-\alpha \upsilon A)}^{2}{\mathrm{\Sigma}}_{1}+{\alpha}^{2}{\upsilon}^{2}{A}^{2}{\mathrm{\Sigma}}_{2}.\hfill \end{array}$$ |

Consequently, we have

$$\mathbb{V}[{\theta}_{\mathrm{\infty}}^{2M}]={\alpha}^{2}{\left[I-{(I-\alpha \upsilon A)}^{2}{(I-\alpha (1-\upsilon )A)}^{2}\right]}^{-1}{A}^{2}\left({(1-\upsilon )}^{2}{(I-\alpha \upsilon A)}^{2}{\mathrm{\Sigma}}_{1}+{\upsilon}^{2}{\mathrm{\Sigma}}_{2}\right).$$ |

## Appendix B

This appendix provides detailed information about hyper-parameter settings of four competing algorithms, i.e., SAC, PPO, TD3 and IPG, which have been evaluated empirically in Section 5. We do not discuss hyper-parameter settings of TD3-IM and TD3-2M because they adopt the same settings as TD3. For all algorithms, the value function network and the policy network have been implemented as DNNs with two hidden layers. A total of 128 ReLU hidden units have been deployed into each hidden layer. This is a commonly used network architecture that has been exploited to solve successfully many challenging continuous action benchmarks [14]. The Adam optimizer was also used consistently to train both the value function network and the policy network in all competing algorithms [22].

For SAC, the reward discount factor $\gamma =0.99$. The maximum size of the replay buffer is 1M samples. We checked several different learning rate settings ranging from 0.0001 to 0.003 and found that the algorithm can achieve reliable performance over all benchmarks by setting the learning rate to 0.001. We also set the size of the batch of samples used for training the actor and the critic, i.e., $\parallel \mathcal{B}\parallel $, to 100. In fact SAC is not sensitive to this hyper-parameter and exhibited similar performance with other optional batch sizes including 64, 128 and 256. On the other hand, as demonstrated in [14], the performance of SAC is heavily influenced by the entropy regularization factor. Hence we tested several different settings of this hyper-parameter, including 0.2, 0.5, 1.0, 2.0, 5.0, and 10.0. The best performance of SAC achieved with respect to the most suitable setting of the entropy regularzation factor on every benchmark has been reported in Section 5.

Similar to SAC, in PPO, the reward discount factor $\gamma =0.99$. Each learning iteration is performed on 4000 newly collected environment samples. The learning rate setting for PPO is different from that of SAC. In particular, in order for PPO to learn reliably, the learning rate for training the value function is set to 0.001, which is slightly larger than the learning rate of 0.0003 utilized for training the policy. The clipping factor of PPO is set to either 0.1 or 0.2, subject to the corresponding performance on the respective benchmarks. Only the better performance witnessed in between the two alternative settings has been reported in Section 5. Meanwhile, PPO in our experiments adopted the Generalized Advantage Estimation (GAE) technique introduced in [36] with hyper-parameter $\lambda =0.95$.

As for TD3, following [7], besides setting $\gamma $ to 0.99 and the maximum replay buffer size to 1M samples, the learning rate for both the actor and the critic has been set to 0.001. We checked other possible settings for the learning rate, ranging from 0.0001 to 0.003, and found that TD3 can achieve the best tradeoff between learning performance and reliability when the learning rate equals to 0.001. Different from SAC, the batch size in TD3 is 256. However, setting the batch size to 100 as recommended in [7] will not noticeably bring the average learning performance down. Nevertheless, with the batch size of 256, TD3 appears to enjoy slightly better reliability. In addition to the above, we set the standard deviation for ${\mu}_{a}$ in (6) to 0.2. This setting can slightly improve learning reliability in comparison to the recommended setting of 0.1 [7]. Meanwhile, the standard deviation of the action sampling noise for the target action smoothing mechanism is set to 0.2. The noise is also clipped between -0.5 and 0.5.

Most of the hyper-parameter settings for IPG are similar to those of PPO. Nevertheless, there are some notable differences. Particularly, $\lambda =0.97$ for GAE. This is shown to produce slightly better performance than setting $\lambda $ to 0.95. The batch size for policy and value function update is 64. We tested several different settings of the interpolation coefficient $\nu $, including 0.0, 0.2, 0.5 and 0.8. It seems that IPG can achieve consistently better performance when $\nu =0.2$. Experiment results corresponding to this setting have been reported in Section 5.