Abstract
In recent years significant progress has been made in dealing withchallenging problems using reinforcement learning.Despite its great success,reinforcement learning still faces challenge in continuous control tasks.Conventional methods always compute the derivatives of the optimal goal with acostly computation resources, and are inefficient, unstable and lack ofrobustness when dealing with such tasks. Alternatively, derivativebasedmethods treat the optimization process as a blackbox and show robustness andstability in learning continuous control tasks, but not data efficient inlearning. The combination of both methods so as to get the best of the both hasraised attention. However, most of the existing combination works adopt complexneural networks (NNs) as the policy for control. The doubleedged sword of deepNNs can yield better performance, but also makes it difficult for parametertuning and computation. To this end, in this paper we presents a novel methodcalled FiDiRL, which incorporates deep RL with FiniteDifference (FiDi) policysearch.FiDiRL combines Deep Deterministic Policy Gradients (DDPG)with AugmentRandom Search (ARS) and aims at improving the data efficiency of ARS. Theempirical results show that FiDiRL can improves the performance and stabilityof ARS, and provide competitive results against some existing deepreinforcement learning methods
Quick Read (beta)
FiDiRL: Incorporating Deep Reinforcement Learning with FiniteDifference Policy Search for Efficient Learning of Continuous Control
Abstract
In recent years significant progress has been made in dealing with challenging problems using reinforcement learning. Despite its great success, reinforcement learning still faces challenge in continuous control tasks. Conventional methods always compute the derivatives of the optimal goal with a costly computation resources, and are inefficient, unstable and lack of robustness when dealing with such tasks. Alternatively, derivativebased methods treat the optimization process as a blackbox and show robustness and stability in learning continuous control tasks, but not data efficient in learning. The combination of both methods so as to get the best of the both has raised attention. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The doubleedged sword of deep NNs can yield better performance, but also makes it difficult for parameter tuning and computation. To this end, in this paper we presents a novel method called FiDiRL, which incorporates deep RL with FiniteDifference (FiDi) policy search. FiDiRL combines Deep Deterministic Policy Gradients (DDPG) with Augment Random Search (ARS) and aims at improving the data efficiency of ARS. The empirical results show that FiDiRL can improves the performance and stability of ARS, and provide competitive results against some existing deep reinforcement learning methods.
I Introduction
Recent years have witnessed the thriving development in reinforcement learning RL. Significant progress has been made in areas like: playing the Game Go [1, 2], Atari [3], robotics [4], and realtime strategy games [5, 6]. Despite its great success, reinforcement learning still faces challenge in physical control tasks, which have realvalued continuous action spaces. Conventional reinforcement learning methods for continuous policy search seek to approximate the optimal policy by estimating the gradient via expected return obtained from sampled trajectories. Famous methods such as vanilla policy gradient [7], (deep) deterministic policy gradient (DPG/DDPG) [8, 4], Asynchronous Advantage ActorCritic (A3C) [9], Trust Region Policy Optimization [10] and Proximal Policy Optimization (PPO) [11] are widely adopted in practical use. Those methods have shown promising results in continuous control tasks such as Mujoco locomotion tasks [12]. Unfortunately, existing gradient methods are not robust to the environment and are fragile to the hyperparameters or even random seeds [13, 14]. In addition, the estimation of gradient are always computation costly and are not easy to parallelize [15], which narrowed its application.
In contrast, an alternative approach for solving continuous control tasks without computing the gradients is based on random search theory. Those methods, which are also known as ”blackbox optimization”, treat the optimization process as a black box and only use the return of each simulation to search the optimal policy directly, i.e, the direct policy search [16]. Gradientfree methods such as CrossEntropy Method (CEM) [17], Evolutional Strategies (ES) [18, 15] and Finite Difference Method (FDM) [14, 19] offer low computation cost, considerable result, fast training speed while also being easy to understand [10]. Although gradientfree methods show robustness to the environment and are easy to conduct [14, 15], they also suffer from low sample efficiency. Unlike derivativebased methods that can learn from each elementary step back and forth from the simulated path [9], in gradientfree methods numerous generated trajectories are discarded after evaluate the estimated return.
Recently, the combination of gradientfree methods and gradientbased methods so as to get the best of the both methods has raised attention. For example, GEPPG [20] adopts goal exploration process to fill the replay buffer and then applies DDPG to learn the policy. Their algorithm is more sample efficient than DDPG. Evolutional reinforcement learning (ERL) [21] presents an efficient combination of ES and DDPG. The DDPG policy is periodically inserted into the population of ES and the performance outperforms the both methods. CEMRL [22] combines Covariance Adaption Method (CMA) with deep RL methods and the results show that the learning can be faster and better. Maheswaranathan et al. [23] theoretically analyzes the optimization problems with a surrogate gradient and incorporate the ES with such surrogate gradient. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The doubleedged sword of deep NNs can yield better performance, but also introduce numerous of hyperparameters defining the network structure, which makes it difficult for parameter tuning. In addition, The backpropagation of such networks are always computation costly when performing gradientbased learning. Nevertheless, in [24, 14], they have shown that simple linear representation of policy is enough for control those tasks.
To address this issue, in this paper we propose FiDiRL, which incorporates deep RL insights with FiniteDifference (FiDi) policy search based on linear representation of policies. FiDiRL combines DDPG with ARS and aims at improving the data efficiency of ARS. We evaluate FiDiRL in the wellknown Mujoco locomotion tasks [12]. Our empirical results show that FiDiRL can accelerate the learning comparing to the original ARS and DDPG, and can improve performance and stability of ARS as well. In addition, FiDiRL also shows competitive result against popular policy search methods such as CEM, PPO, TRPO.
The rest of this paper is organized as follows: Section II discusses the related work. Section III presents the preliminary of reinforcement learning for continuous control and problem definition of this paper. Section IV introduce the proposed framework. Section V provides experimental analysis, followed by conclusion and discussion in Section VII.
II Related Works
Here, we summarize some related works. We divide the related works into 3 categories:
IIA Gradientbased Methods for Continuous Control
Various existing works on derivativebased reinforcement learning were proposed for solving the continuous control tasks. In this part we mainly focus on the Policy Gradient methods. Since conventional policy gradient methods such as REINFORCE [7] are too fragile to scale to difficult problems [25] [26], many new policy gradient methods are proposed with the development of deep learning. For example, [27] and [28] use an actorcritic framework to train stochastic policies with experience replay. Inspired by offpolicy actorcritic [29], Deterministic Policy Gradient (DPG) [8] and its derivatives play an important role for RL methods in continuous control. However, the original DPG only evaluated on several simple toy environments and did not evaluate on complex environments with high dimensional observation space. The most popular version, Deep Deterministic Policy Gradient algorithm (DDPG) [4] integrates DPG with the insights of DQN [3], has been successful adopted into many robotic control tasks. Despite its success, DDPG is suffered from low stability and lack of sample efficiency. To address this issue, Qprop [30] was proposed using a Taylor expansion of the offpolicy critic as a control variate, which improves the stability and sample efficiency of DDPG. TD3 [31] was designed to overcome the overestimation problem in DDPG, and improves the performance as well. In addition, Soft actorcritic [32] was proposed based on the maximum entropy reinforcement learning framework, which further improves the stability of DDPG.
Another approach is Trust Region Policy Optimization (TRPO) [10], which directly builds stochastic policies without the actorcritic framework. TRPO produces near monotonic improvements in return by making carefully updates to policy. However, as no actionvalue function is learned, TRPO appears to be less data efficient [4]. Based on TRPO, ACKTR [33] uses Kroneckerfactored approximation and propose an actorcritic based TRPO, which improves the performance and data efficiency in TRPR. Furthermore, Proximal Policy Optimization (PPO) [11] improves TRPO by alternately sampling and optimizing the policy, improves the sample efficiency of TRPO.
To conclude, most of the current gradientbased reinforcement learning methods for continuous control face problems such as instability and computation costly. To overcome those issues the stateoftheart methods become more and more complicated.
IIB Gradientfree Methods for Continuous Control
As an alternative method to gradientbased RL, most of the gradientfree methods for continuous control can be categorized into 3 classes:

1.
Evolutional strategies (ES): As a popular blackbox optimization approach, ES adopts genetic algorithm to optimize the policy. The simplest ES form, optimize the policy directly using ES [34][35]. Covariance Adaption Method (CMA), is another famous ES method which combine ES with ideas of derandomization and cumulation [36][37]. [38] gives an review of ES for learning various tasks. Recent works with ES [15] shows that ES can learn continuous tasks with high scalability and efficiency.

2.
Crossentropy methods (CEM): Similar to CMA, CEM evolves the policy by generating the parameters with high reward [39][40]. For example, [41] proposes a CEM method with adaptive basis function. [42] uses CEM to learn policies in decentralized partial observable Markovian Decision Process (POMDP) environments. CEM method can learn very fast, but also easy trapped in local optima.

3.
Finite difference methods (FDM): FDM uses finite difference to calculate the approximation of the gradients and therefore optimize the policy by gradient ascent/descent. For example, Pegasus is proposed based on FDM for solving POMDP problems. [19]. Recently, [14] propose ARS, an efficient method for learning continuous control tasks with low computation cost and high efficiency.
In summary, although gradientfree method can learn the policy fast with low computation cost, the generated trajectories are only used once to obtain the rewards for evaluation.
IIC Combining Gradientbased Methods with Gradientfree Methods
Several works have explored the combination of gradientbased methods and gradientfree methods. For example, Goal Exploration ProcessPolicy Gradient (GEPPG) [20] adopts goal exploration process to fill the replay buffer and then uses DDPG to learn the policy. The GEP is very close to evolutional methods. Their experiments show that GEPPG is more sampleefficient and have a low variance comparing to DDPG. However, their combination does not improve the efficiency of gradient update of DDPG. Evolutional Reinforcement Learning (ERL) [21] introduce a hybrid algorithm that periodically insert the DDPG agent to the evolutional optimization process, and improves the stability and efficiency in learning and exploration. However, the ERL does not benefit from the search efficiency of ES. Their experiment result also shows high variance in some tasks. Similar to our work, in [23], the authors analyze the optimization problems with a surrogate gradient, and show that incorporating ES to surrogate gradient can improve the performance and efficiency of traditional RL methods. However, their experiments only performed on simple tasks and lack of practical demonstrations. In addition, the CEMRL [22] is also very close to our work. In this work, the CMA and gradientbased method DDPG/TD3 is combined together. Their result shows that learning can be accelerated with CEMRL. However, in CEMRL neural networks were used for policy representation. Their experiment results show that the structure of policy networks are sensitive to the performance.
III Preliminaries and Problem Settings
In this section, we introduce some basic concepts, notions and our target problems.
IIIA Preliminaries
Reinforcement learning problems can be formulated as a Markovian Decision Process (MDP): $(S,A,\gamma ,P,R)$, where $S$ is the state space, $A$ is the action space, $\gamma \in [0,1]$ is the discount factor and $P$ is the transition function that maps each stateaction pair $(s,a)\in (S,A)$ to some distribution over $S$. We consider the standard reinforcement learning setup: an agent interacting with the environment in discrete time steps, at each time step $t$, the agent observes a state $s\in S$, takes an action under some policy $\pi $, and receives a scalar reward $r\in R$. A policy $\pi $ describes the agent’s behavior, which is a probability distribution that maps a state to an action: $\pi (as):S\times A\to [0,1]$.
The return from a state $s$ is defined as the total discounted future reward: ${G}_{t}={\sum}_{i=t}^{T}{\gamma}^{it}r({s}_{i},{a}_{i})$, where $T$ is the terminal state. The stateaction value $Q$ is a mapping on $S\times A$ to $\mathbb{R}$, which describes the expected discounted future reward when taking action $a$ at observation $s$ following policy $\pi $:
$$Q(s,a)={\mathbb{E}}_{\pi}({r}_{1}+\gamma {r}_{2}+\mathrm{\dots}+{\gamma}^{T1}{r}_{T}{s}_{0}=s,{a}_{0}=a)$$ 
The goal of reinforcement learning is to find an optimal policy which maximize the expected return from the starting state:
$$J={\mathbb{E}}_{{s}_{t}\sim \rho}[{{G}_{t}}_{{s}_{t},{a}_{t}}]$$ 
Here $\rho $ is the state distribution under policy $\pi $.
The Bellman equation describes the recursive relationship in stateaction value:
$$Q({s}_{t},{a}_{t})=\mathbb{E}[r({s}_{t},{a}_{t})+\gamma \mathbb{E}Q({s}_{t+1},{a}_{t+1})]$$ 
One popular RL method for dealing with continuous task is Deep Deterministic Policy Gradient (DDPG) [4], which adopts actorcritic in policy gradient. The actor of DDPG optimize the policy directly using the deterministic policy gradient theorem [8]. Denoting ${\pi}_{{\theta}^{\pi}}(s{\theta}^{\pi})$ and $Q(s,a{\theta}^{Q})$ as the parameterized policy $\pi $ and the actionvalue function $Q$ respectively, the gradient of the loss function ${J}_{DDPG}$ can be calculated as:
${\nabla}_{{\theta}^{\pi}}{J}_{DDPG}=$  ${\mathbb{E}}_{{s}_{t}\sim \rho}[\nabla Q(s,a{\theta}^{Q}){}_{s={s}_{t},a=\pi ({s}_{t})}\cdot $  (1)  
${\nabla}_{{\theta}^{\pi}}\pi (s{\theta}^{\pi}){}_{s={s}_{t}}]$ 
The loss ${J}_{Q}$ of the critic function can be calculated:
$${J}_{Q}={\mathbb{E}}_{{s}_{t}\sim \rho}[{({r}_{t}+{Q({s}_{t+1},{a}_{t+1})}_{{\theta}^{Q}}{Q({s}_{t},{a}_{t})}_{{\theta}^{Q}})}^{2}]$$  (2) 
During learning, gradient ascent is performed on the actor function to maximize ${J}_{DDPG}$, while gradient descent is performed on the critic function to minimize ${J}_{Q}$.
Recently, a new method named Augment Random Search (ARS) [14] was proposed, aiming at solving the continuous control task with high efficiency and low computation cost. As a gradientfree algorithm, ARS optimize the policy directly by calculating finite difference approximation along the random direction. Denoting $\xi $ as the unknown dynamics of the environment, $\sigma $ as the generated random variables, $r({\pi}_{\theta})$ as the total discount reward received by evaluating policy ${\pi}_{\theta}$ and $v$ as the exploration range, the loss function can be expressed with:
$${J}_{ARS}=\underset{{\theta}^{\pi}}{\mathrm{max}}{\mathbb{E}}_{\xi}[r({\pi}_{\theta},\xi )]$$  (3) 
The gradient can be approximate with:
$${\nabla}_{{\theta}^{\pi}}{J}_{ARS}\approx \frac{r({\pi}_{\theta +v\sigma},\xi )r({\pi}_{\theta v\sigma},\xi )}{2v}$$  (4) 
Based on the normalization to the observations and gradients, ARS contains 4 versions: ARS$v1$,ARS$v2$,ARS$v1t$ and ARS$v2t$. In ARS$v1$, the gradients from Equation 4 are normalized by dividing their standard deviation. In ARS$v2$, state normalization is added to ARS$v1$. In their version with $t$, only policies with top rewards are selected to compute the gradients.
IIIB Problem Settings
In this paper we consider the standard modelfree reinforcement learning setting in control, which an agent interacts with the environment. We specially focus on the continuous control tasks, where the action spaces are realvalued: $A\subseteq {\mathbb{R}}^{d}$. The policy is parameterized by some function: ${\pi}_{\theta}(s):\varphi (s\theta )\to A$. We are aiming at finding the optimal policies ${\pi}_{\theta}$ that can maximize the average return under some environment variables $\xi $:
$$\pi (s\theta ):\mathrm{arg}\underset{\theta}{\mathrm{max}}{\mathbb{E}}_{\xi}[r({\pi}_{\theta})]$$ 
To address the above issues in deep neural networks, our goal is to find an optimal policy with linear representation with gradientbased and gradientfree methods.
IV FiDiRL: Combining Deep Deterministic Policy Gradient with Augment Random Search
In this section, we will introduce FiDiRL in detail. We will first illustrate the overall architecture. Then the detail algorithm is described.
IVA Architecture
Proposed in 2018 by Mania et al., ARS has shown great potential in solving continuous control tasks with simple linear policies and low computation cost. Comparing to conventional RL methods, ARS optimizes the policy by performing random search on simulated trajectories. However, ARS can only learn from a complete episodes and lack of data efficiency. In contrast, offpolicy learning methods can learn the elementary steps repeatedly and can great improve the sample efficiency. Therefore, in this paper we incorporate the ARS with offpolicy RL methods. Since ARS optimize the policy directly, offpolicy policy gradient methods is mandatory for combination. Popular offpolicy policy gradient methods such as DPG, DDPG, TD3 and algorithms derived by those methods are capable for relearning the trajectories. Our FiDiRL architecture is able to integrate ARS with those kind of RL methods, but for convenience of expression, we here use DDPG as an example.
As illustrated before, ARS estimates the gradients of the policy by evaluating the policies with some designated random noise. In contrast, DDPG computes the gradients based on deterministic policy gradient theorem [8] through offpolicy learning. At each iteration, the two methods both use a single policy for optimization, and the overall optimization goal of the both methods is the same essentially. Therefore, we can use a linear combination to describe the integrated loss function of FiDiRL as below:
$J$  $={J}_{ARS}+\lambda {J}_{DDPG}$  (5) 
Since the loss functions of both methods are aiming at maximize the total expected discount reward, their linear combination $J$ holds the same optimization goal as well. A new involved parameter $\lambda $ denotes the update proportion of DDPG, which serves as a control variable for the tradeoff between the gradientfree learning and gradientbased learning. Based on this equation, the gradient of FiDiRL loss function can be calculated as a weighted summation of the two methods:
$\nabla J=\nabla {J}_{ARS}+\lambda \nabla {J}_{DDPG}$  (6) 
As DDPG is an offpolicy RL method, the loss function of actor and critic can be calculated using precollected data from replay buffer. Therefore, we can directly fill the replay buffer of DDPG using the generated trajectories by ARS. Since the ARS perform exploration by adding parameter space noise, such integration can also facilitate the exploration of DDPG as well [43]. The overall framework of FiDiRL is illustrated in Figure 1. Note that the critic function in our framework is not a mandatory. The return can also be evaluated by MonteCarlo method based on the sample trajectories [7, 8], which can further reduce the computation cost. By performing gradient ascent iteratively, the policy can be improved.
Our architecture is different with the existing combination works [20, 21, 22] in several ways. First, we use a single policy during the whole learning process. The absence of draw the elites from the populations can make the learning more efficient. Second, the combination between gradientbased and gradientfree methods is conduct through a collaborate loss function. The optimization is performed through both gradientfree and gradientbased learning, thus making the optimization process more directly.
IVB Algorithm
Based on the architecture, we give the algorithm in Algorithm IVB. We here use linear policies for learning, denoting $M$ as the policy matrix, which is a $p\times n$ matrix. The pseud code of prototype algorithm is described in Algorithm IVB.
For each episode of learning, we first generated rollouts and store the them into experience replay buffer. We then use ARS algorithm (either v1,v2 or v1t,v2t) to compute the gradients w.r.t current policy. Subsequently, the DDPG agent learns from the generated rollouts to optimize both actor and critic function. The initial actor function for learning is identical to the origin policy in the beginning of the episode. The gradients of policy is accumulated w.r.t the learning steps of derivativebased method. Finally the gradients of the two method is aggregated.
\[email protected]@algorithmic[1] \STATEInput:ARS learning rate $\alpha $, number of directions sampled per iteration of ARS $N$, standard deviation of exploration noise $v$, number of topperforming directions to use $b$ ($b\le N$), DDPG learning step ${T}_{d}$, DDPG update coefficient $\lambda $, DDPG critic learning rate ${l}_{c}$ experience replay refresh period $P$. \STATEInitialize: initial policy ${M}_{0}=0\in {\mathbb{R}}^{p\times n}$, critic network $C$ with random initialized, experience replay buffer $B$. \FOR$i=1:T$ \STATEGenerate $2N$ policies with i.i.d standard normal entries. \STATECollect $2N$ rollouts based on the policies and evaluate the rollouts with total discounted rewards. \STATEStore the $2N$ rollouts in the experience replay buffer $B$ of DDPG. \STATECompute gradients ${G}_{ARS}$ of policy ${M}_{i}$ based on Equation 4. \STATEInitialize DDPG gradients ${G}_{DDPG}=0\in {\mathbb{R}}^{p\times n}$ \FOR$j=1:{T}_{d}$ \STATECalculate gradient ${G}_{j}$ based on Equation 1 for policy ${M}_{i}$. \STATE${G}_{DDPG}={G}_{DDPG}+{G}_{j}$ \STATEUpdate critic network $C$ by gradient descent with learning rate ${l}_{c}$ with the loss function in Equation 2. \ENDFOR\STATE${M}_{i+1}={M}_{i}+{G}_{ARS}+\lambda {G}_{DDPG}$ \IF$imodP=0$ \STATEFlush the replay buffer $B$. \ENDIF\ENDFOR
Considering the integration of DDPG with ARS, we here make two improvements to the original DDPG algorithm:

1.
No soft target policy updates. In the original DDPG algorithm, soft target updates are used to improve the learning stability at the risk of slowing down the learning [4]. With our method, as the policy are updated by both ARS and DDPG, the DDPG update is controlled by the update step parameter ${T}_{d}$ and update proportion $\lambda $. Hence, it is unnecessary to use soft target policy update. In addition, since the critic network are critical to the policy update, the synchronous update of actor and critic are indispensable. In our experiment, we also find the critic network become vulnerable to diverge with softupdate.

2.
Periodically refresh the experience replay buffer. Experience replay is an important mechanism for reinforcement learning [1, 44]. With experience replay, the robustness and data efficiency of learning are improved. Instead of replace the old transitions in the replay buffer, we here use a more controllable version of original experience replay to make full use of the generated data and facilitate the learning as well. The experience replay buffer is periodically flushed under a refresh parameter $P$ w.r.t the episode number $i$.
By iteratively improve the policy with FiDiRL, we can finally obtain a policy that meet our goal.
V Experiment Result
VA Experiment Setup
We evaluate the FiDiRL method on the Mujoco environment [12], which is a popular environment for evaluating RL algorithms for continuous control tasks. The implementation is based on the OpenAI Gym [45]. Six robotic control tasks are used for evaluation: HalfCheetahv2, Antv2, Hopperv2, Swimmerv2, Wakler2dv2 and Humanoidv2:

•
HalfCheetahv2: Agent controls a cheetahlike body to run forward as quickly as possible. The state dimension is 17 and the action dimension is 6.

•
Antv2: Agent controls a 4leg ant to move forward as quickly as possible. The state dimension is 111 and the action dimension is 8.

•
Hopperv2: Agent controls a monoped to keep it from falling. The state dimension is 11 and the action dimension is 3.

•
Swimmerv2: Agent controls a snakelike robot to swim forward as fast as possible. The state dimension is 8 and the action dimension is 2.

•
Walker2dv2: Agent controls a bipedal walker to move forward as fast as possible. The state dimension is 17 and the action dimension is 6.

•
Humanoidv2: Agent control a humanlike robot to stand up. The state dimension is 376 and the action dimension is 17.
In performance evaluation part, We evaluate our method with stateoftheart RL methods: DDPG [4], TRPO [10] PPO [11], ARS [14] and CEM [17] under 5 random seeds to evaluate the performance of FiDiRL. In addition, we also evaluate the total learning steps for reaching a prescribed threshold against some derivativebased methods to evaluate the learning efficiency. The details of the experiment settings can be found in the Appendix part.
VB Performance Evaluation under 5 Random Seeds
We evaluate our method against several stateoftheart RL methods. For each method, we evaluated the policy periodically during training by testing the policy without random exploration noise. We also evaluate our method against some derivativebased methods: DDPG, TRPO and PPO.
For ARS, CEM and ARS+DDPG, each episode runs with a whole learning step: policy generation, simulation, evaluation and policy update. Then we test the policy without any randomness for evaluation. The policy used in ARS, FiDiRL and CEM are linear policies that simple map observation to action. For critic function in FiDiRL, we adopt a 2 layer neural network, with rectified linear unit as activate function for hidden layer and output directly without any activation function. Each layer contains 64 neurons. We run ARS and FiDiRL with 10 random fixed seeds and select 5 seeds with the best results. To make a fair comparison between ARS and FiDiRL, for each tasks the random seeds of both methods are identical. In addition we also give the learning curves of the baseline methods. We use OpenAI Spining Up^{1}^{1} 1 https://spinningup.openai.com/en/latest/ to run the baseline experiments. For the gradientbased methods, each epoch contains one episode including 1000 steps of iteration with hyperparameters in appendix. The evaluation results is shown in Figure 3. The solid lines represent the mean total reward of 5 independent runs and the shaded region represent the standard deviation of the results. All the results are smoothed using rightcentered moving average of 50 successive epochs. From the figures we can see the learning speed of FiDiRL is generally faster than others (except Humanoidv2). For most of the tasks, FiDiRL outperforms the rest algorithms. In addition, We find that with DDPG incorporated with ARS, the algorithm turns out to be more stable with smaller variance than the original ARS. In addition, we also notice that linear policies are also sufficient for training those tasks, which is neatly impossible with deep RL algorithm.
We also find that in Hopperv2 and Walker2dv2, the original ARS is always trapped in local optima, which caused the high variance in the performance. In Figure 4, we give the result of 20 times run under Hopperv2 environment with 20 arbitrary random seeds. In such setting, ARS converges to the score around 1000 19 times and only 1 time goes up to 3000, while FiDiRL converges above 3000 4 times. The FiDiRL can help the ARS to go the local optima in the learning process thus enhancing the stability of ARS.
VC Average Learning Steps to Reach the Threshold
We also compare the average learning steps to reach a prescribed threshold of FiDiRL against the gradientbased methods. The threshold we adopt here is the same as [30]. The hyperparameters of FiDiRL were chosen based on the same evaluations on Table II. We compare the FiDiRL against the gradientbased methods SAC, DDPG, PPO and TD3. The learning steps of FidiRL are calculated by sum of gradientbased iteration and finitedifferencebased iteration. In addition, the learning steps for SAC, DDPG, PPO and TD3 are estimated according to the learning curves in [32, 11]. Table I shows the results. From the results we can see with the help of gradientfree policy update, the FiDiRL requires fewer learning timesteps than the gradientbased methods to reach the threshold of each environment.
Enviroment  Threshold  SAC  DDPG  PPO  TD3  FiDiRL 

HalfCheetahv2  $4700$  $0.1\times {10}^{6}$  $0.6\times {10}^{6}$  $>1\times {10}^{6}$  $0.1\times {10}^{6}$  $25149$ 
Antv2  $3500$  $0.8\times {10}^{6}$  NA  NA  $0.6\times {10}^{6}$  $12423$ 
Hopperv2  $2000$  $0.3\times {10}^{6}$  $>1\times {10}^{6}$  $0.2\times {10}^{6}$  $0.35\times {10}^{6}$  $9999$ 
Swimmerv2  $90$  unknown  unknown  $0.6\times {10}^{6}$  unknown  $1414$ 
Waker2dv2  $3000$  $0.3\times {10}^{6}$  NA  $3.2\times {10}^{6}$  $0.6\times {10}^{6}$  $229775$ 
Humanoidv2  $2500$  $0.2\times {10}^{6}$  NA  $3.5\times {10}^{6}$  NA  $43935$ 
Environment  number of $\delta $  number of best $\delta $  noise  ARS learning rate  DDPG update portion  DDPG step  critic learning rate 

HalfCheetahv2  $16$  $0.03$  $0.02$  $0.02$  $0.0001$  $100$  $0.0001$ 
Antv2  $60$  $20$  $0.025$  $0.015$  $0.01$  $100$  $0.0001$ 
Hopperv2  $8$  $4$  $0.025$  $0.01$  $0.0001$  $100$  $0.0001$ 
Swimmerv2  $4$  $4$  $0.01$  $0.02$  $0.01$  $100$  $0.0001$ 
Walker2dv2  $40$  $30$  $0.025$  $0.03$  $0.01$  $100$  $0.0001$ 
Humanoidv2  $230$  $230$  $0.075$  $0.02$  $0.01$  $100$  $0.0001$ 
VD Learned Behavior
We also study the learned behavior of FiDiRL and ARS in HalfCheetah environment. We use the welltrained model of FiDiRL and ARS without exploration noise and record the performance in the Mujuco environment. The video can be found in https://youtu.be/a1Qtfg1kql8. From the video we can see FiDiRL can run much faster than ARS. The agent controlled by FiDiRL is also more energysaving in running, as the swing range of each leg is smaller than ARS. Moreover, in the beginning, the FiDiRL controlled agent shows high adaption capacity than ARS controlled agent as the agent accelerates faster in running. Through learning the elementary transitions via gradientbased learning, FiDiRL improves the performance of gradientfree learning.
VI Conclusion and Discussion
In this paper we propose FiDiRL, a novel method that incorporating deep reinforcement learning method DDPG with gradientfree policy search method ARS. The FiDiRL method can use simple linear method to cope with complex continuous control tasks through policy iteration by integration of gradientbased and gradientfree method. The tradeoff between those methods can also be adjusted by the new involved parameters. Empirical results show that the FiDiRL can improve the data efficiency, stability and performance of augment random search. Moreover, our result also provides a competitive alternate method to current deep reinforcement learning method.
Limitations and future works: Incorporating deep RL methods with ARS also involves new hyperparameters to the original ARS. Comparing to the original ARS, it’s more difficult for parameter tuning. In the experiment we also find that the those new involved parameters may be dynamic adjusted according to the learning process. We will study this issue in the future. In addition, inherited from ARS and DDPG, the FiDiRL is also more versatile to the random seeds. In the future we will focus on further improvement of stability of FiDiRL.
Appendix A hyperparameters for ARS and FiDiRL
Table II gives the parameters of FiDiRL and ARS. For each environment, the FiDiRL and ARS share the same parameters on gradientfree update. For all the experiments conducted in this paper, we use the discount factor of $1$. The batchsize of gradientbased learning in FiDiRL is set to $128$. All the experiments of FiDiRL and ARS are conducted under 10 fixed random seeds and we select 5 best ones to draw the performance comparison in Figure 3.
Appendix B Hyperparameters for CEM
For the CEM method, we here use the same simulation episodes as in FiDiRL and ARS. The parameters are sampled with Gaussian distribution. Other parameters are listed in Table III
Environment  Population size  Top %  Initial std of Parameters 

HalfCheetahv2  32  20%  0.5 
Antv2  120  20%  1 
Hopperv2  16  20%  1 
Swimmerv2  32  20%  1 
Walker2dv2  80  20%  1 
Humanoidv2  460  20%  1 
Appendix C Hyperparameters for Gradientbased Methods
As described before, we run the gradientbased methods using the OpenAI Spinning Up. For each algorithm the actor and critic function is a 2layers neural network with $32$ neurons in each layer. The discount factor is set to $1$. For each epoch we run $1000$ learning iterations among all the environments. The main parameter we tuned in experiment is learning rate of actor and critic networks. The other parameters is as default in the SpinningUp. Table IV gives the learning rates of different methods.
DDPG  

Environment  Actor learning rate  Critic learning rate 
HalfCheetah  0.0001  0.0001 
Antv2  0.001  0.0001 
Hopperv2  0.0001  0.0001 
Swimmerv2  0.0001  0.0001 
Walker2dv2  0.0001  0.0001 
Humanoidv2  0.0001  0.0001 
TRPO  
HalfCheetah  0.0001   
Antv2  0.0001   
Hopperv2  0.0001   
Swimmerv2  0.0001   
Walker2dv2  0.0001   
Humanoidv2  0.0001   
PPO  
HalfCheetah  0.0001   
Antv2  0.0001   
Hopperv2  0.0001   
Swimmerv2  0.0001   
Walker2dv2  0.0001   
Humanoidv2  0.0001   
Acknowledgment
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
References
 [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
 [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
 [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [5] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks,” arXiv preprint arXiv:1609.02993, 2016.
 [6] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multiagent policy gradients,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [7] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [8] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
 [9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [10] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [12] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
 [13] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [14] H. Mania, A. Guy, and B. Recht, “Simple random search of static linear policies is competitive for reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 1800–1809.
 [15] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017.
 [16] J. Schmidhuber and J. Zhao, “Direct policy search and uncertain policy evaluation,” in AAAI spring symposium on search under uncertain and incomplete information, stanford univ, 1998, pp. 119–124.
 [17] I. Szita and A. Lörincz, “Learning tetris using the noisy crossentropy method,” Neural computation, vol. 18, no. 12, pp. 2936–2941, 2006.
 [18] F. Stulp and O. Sigaud, “Path integral policy improvement with covariance matrix adaptation,” in Proceedings of the 29th International Conference on Machine Learning. Omnipress, 2012, pp. 1547–1554.
 [19] A. Y. Ng and M. Jordan, “Pegasus: A policy search method for large mdps and pomdps,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2000, pp. 406–415.
 [20] C. Colas, O. Sigaud, and P.Y. Oudeyer, “Geppg: Decoupling exploration and exploitation in deep reinforcement learning algorithms,” in International Conference on Machine Learning, 2018, pp. 1038–1047.
 [21] S. Khadka and K. Tumer, “Evolutionguided policy gradient in reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 1188–1200.
 [22] A. Pourchot and O. Sigaud, “Cemrl: Combining evolutionary and gradientbased methods for policy search,” arXiv preprint arXiv:1810.01222, 2018.
 [23] N. Maheswaranathan, L. Metz, G. Tucker, D. Choi, and J. SohlDickstein, “Guided evolutionary strategies: augmenting random search with surrogate gradients,” in International Conference on Machine Learning, 2019, pp. 4264–4273.
 [24] A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade, “Towards generalization and simplicity in continuous control,” in Advances in Neural Information Processing Systems, 2017, pp. 6550–6561.
 [25] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
 [26] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
 [27] P. Wawrzyński, “Realtime reinforcement learning by sequential actor–critics and experience replay,” Neural Networks, vol. 22, no. 10, pp. 1484–1497, 2009.
 [28] P. WawrzyńSki and A. K. Tanwani, “Autonomous reinforcement learning with experience replay,” Neural Networks, vol. 41, pp. 156–167, 2013.
 [29] T. Degris, M. White, and R. S. Sutton, “Offpolicy actorcritic,” arXiv preprint arXiv:1205.4839, 2012.
 [30] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Qprop: Sampleefficient policy gradient with an offpolicy critic,” arXiv preprint arXiv:1611.02247, 2016.
 [31] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” arXiv preprint arXiv:1802.09477, 2018.
 [32] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
 [33] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation,” in Advances in neural information processing systems, 2017, pp. 5279–5288.
 [34] J. Koutnik, F. Gomez, and J. Schmidhuber, “Evolving neural networks in compressed weight space,” in Proceedings of the 12th annual conference on Genetic and evolutionary computation. ACM, 2010, pp. 619–626.
 [35] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving largescale neural networks for visionbased reinforcement learning,” in Proceedings of the 15th annual conference on Genetic and evolutionary computation. ACM, 2013, pp. 1061–1068.
 [36] N. Hansen and A. Ostermeier, “Completely derandomized selfadaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001.
 [37] V. HeidrichMeisner and C. Igel, “Evolution strategies for direct policy search,” in International Conference on Parallel Problem Solving from Nature. Springer, 2008, pp. 428–437.
 [38] S. Risi and J. Togelius, “Neuroevolution in games: State of the art and open challenges,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 9, no. 1, pp. 25–41, 2015.
 [39] R. Y. Rubinstein and D. P. Kroese, The crossentropy method: a unified approach to combinatorial optimization, MonteCarlo simulation and machine learning. Springer Science & Business Media, 2013.
 [40] S. Mannor, R. Y. Rubinstein, and Y. Gat, “The cross entropy method for fast policy search,” in Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003, pp. 512–519.
 [41] L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska, “Crossentropy optimization of control policies with adaptive basis functions,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 196–209, 2010.
 [42] F. A. Oliehoek, J. F. Kooij, and N. Vlassis, “The crossentropy method for policy search in decentralized pomdps,” Informatica, vol. 32, no. 4, pp. 341–357, 2008.
 [43] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise for exploration,” arXiv preprint arXiv:1706.01905, 2017.
 [44] S. Zhang and R. S. Sutton, “A deeper look at experience replay,” CoRR, vol. abs/1712.01275, 2017. [Online]. Available: http://arxiv.org/abs/1712.01275
 [45] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.