Abstract
Deep reinforcement learning for high dimensional, hierarchical control tasksusually requires the use of complex neural networks as functionalapproximators, which can lead to inefficiency, instability and even divergencein the training process. Here, we introduce stacked deep Q learning (SDQL), aflexible modularized deep reinforcement learning architecture, that can enablefinding of optimal control policy of control tasks consisting of multiplelinear stages in a stable and efficient way. SDQL exploits the linear stagestructure by approximating the Q function via a collection of deep Qsubnetworks stacking along an axis marking the stagewise progress of thewhole task. By backpropagating the learned state values from later stages toearlier stages, all subnetworks coadapt to maximize the total reward of thewhole task, although each subnetwork is responsible for learning optimalcontrol policy for its own stage. This modularized architecture offersconsiderable flexibility in terms of environment and policy modeling, as itallows choices of different state spaces, action spaces, reward structures, andQ networks for each stage, Further, the backward stagewise training procedureof SDQL can offers additional transparency, stability, and flexibility to thetraining process, thus facilitating model finetuning and hyperparametersearch. We demonstrate that SDQL is capable of learning competitive strategiesfor problems with characteristics of highdimensional state space,heterogeneous action space(both discrete and continuous), multiple scales, andsparse and delayed rewards.
Quick Read (beta)
A Deep Reinforcement Learning Architecture for Multistage Optimal Control
Abstract
Deep reinforcement learning for high dimensional, hierarchical control tasks usually requires the use of complex neural networks as functional approximators, which can lead to inefficiency, instability and even divergence in the training process. Here, we introduce stacked deep Q learning (SDQL), a flexible modularized deep reinforcement learning architecture, that can enable finding of optimal control policy of control tasks consisting of multiple linear stages in a stable and efficient way. SDQL exploits the linear stage structure by approximating the Q function via a collection of deep Q subnetworks stacking along an axis marking the stagewise progress of the whole task. By backpropagating the learned state values from later stages to earlier stages, all subnetworks coadapt to maximize the total reward of the whole task, although each subnetwork is responsible for learning optimal control policy for its own stage. This modularized architecture offers considerable flexibility in terms of environment and policy modeling, as it allows choices of different state spaces, action spaces, reward structures, and Q networks for each stage, Further, the backward stagewise training procedure of SDQL can offers additional transparency, stability, and flexibility to the training process, thus facilitating model finetuning and hyperparameter search. We demonstrate that SDQL is capable of learning competitive strategies for problems with characteristics of highdimensional state space, heterogeneous action space(both discrete and continuous), multiple scales, and sparse and delayed rewards.
A Deep Reinforcement Learning Architecture for Multistage Optimal Control
A Preprint
Yuguang Yang
Johns Hopkins University
[email protected]
November 30, 2019
Keywords Multistage optimal control $\cdot $ Deep reinforcement learning
1 Introduction
We consider optimal control tasks that consists of multiple linear stages. Such multistage optimal control problems arise from a broad range of areas[1, 2, 3], including robotics, inventory planning, production planning, transportation and logistics, financial management, and others. The stage structure of a task need not refer to the passage of time, but can corresponds to external specifications or the results of subjective modeling strategy. For example, in quantitative investment, personal investment activities naturally consist of several stages based on the investor’s age and future obligations, with each stage having different preference for risk and return. Separately, a task that involves multiple agents’ sequential cooperation (i.e., the second agent continue with where the first agent has left, but not concurrently working on the same task) has each stage associated with each individual agent. The stage structure can also be a result of a divideandconquer strategy that decompose a complex, temporally extended task into multiple simple ones. For example, a robot’s gripping task can be divided into a first stage of moving the bulk part of the gripper near the object, and a second stage of fine adjustment of fingers to pick up the object. Finally, the stage structure can also be introduced as a modeling strategy to exploit model characteristics(i.e., state/action space and reward structure) at different stages. For example, in financial engineering, the optimal hedging strategies for options can be decomposed into a nearterm stage and a longterm stage to exploit different dominant dynamics in the two regimes.
Here, we aim to develop a general deep reinforcement learning architecture and training algorithm that exploit the stage structure and facilitate the learning process. We adopt a general multistage task setting, where different stages share a subset of state space, but allow to have different action spaces, environment model dynamics, and reward signals. We propose a learning architecture, called Stacked Deep Q Learning (SDQL), where we approximate the optimal control policy and its value function for a multistage task via an architecture consisting a series of deep Q subnetworks stacked together [1]. Each subnetwork is exclusively responsible for learning and control at one stage; thus in general they are simpler and smaller neural networks that can be trained in a stably and efficient way. To ensure that all subnetworks are coadapting to maximize the total reward of the whole task, instead of its own stage, we require that each subnetwork is trained to maximize the cumulative rewards in its own stage plus the reward triggering a stage transition (i.e., finish of a stage and start of the next stage). The stagetransition reward is provided by the value function of subnetwork responsible for the next stage. The corresponding training algorithm will involve a stagewise training process from later stages to earlier stages and state value backpropagation; in this way all subnetworks thus are trained to maximize the total reward across all stages instead of the cumulative rewards in its own stage.
The idea of stacking multiple neural networks and training from back to front are mainly inspired by the backward induction in dynamic programming, especially its application in finitehorizon Markov decision process[4]. Using multiple Q networks in SDQL offers a number of advantages over using a complex neural network for endtoend learning[5]. First, the use of multiple simpler Q networks severs as a form of regulation on the network structure. Simpler network can prevent overfitting and usually can be trained in a more efficient and stable way since each subnetwork is usually responsible for learning at a smaller state space and action space. The stagewise training also allows easier monitoring of the training process, thus adding transparency in the training process. Second, SDQL allows stacking different types of Q learning subnetworks, such as the Deep Q Network (DQN) [6] and actorcritic architecture like Deep Deterministic Policy Gradient (DDPG) network [7] together), which enables flexibility in modeling environment dynamics (homogeneous and heterogeneous) and action space (discrete and continuous) at different stage of the task. Thirdly, the use of multiple neural network also facilitates transfer leaning. For example, a subnetwork that is responsible for moving the bulk part of a gripper in an early stage can be composed with other subnetworks that is responsible for fine movements in later stages. The reuse and combination of different subnetworks can produce control policies for different tasks. Finally, the backward stagewise training process resembles reverse curriculum learning [8] where the agent schedules the learning process to start with easier states (e.g., states near the goal state or states the agent can achieve the goal with high probability), gradually increasing the difficulties. By stacking multiple layers together and train backward, control tasks with delayed and sparse rewards can be naturally addressed by SDQL.
Our main contribution is the introduction of SDQL, an abstract modeling framework and architecture that allows integrating different Q learning network to tackle multistage optimal control tasks. We provide a stagewise backward training algorithm and demonstrate its use in several control tasks with characteristics of highdimensional state space, heterogeneous action space(both discrete and continuous), multiple scales, and sparse and delayed rewards.
2 Theory
2.1 Preliminary: Markov decision process (MDP)
Consider an infinitehorizon Markov decision process (MDP) represented by a tuple $(\mathcal{S},\mathcal{A},\mathcal{P})$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and $\mathcal{P}=\{p({s}^{\prime}s,a),s,{s}^{\prime}\in \mathcal{S},a\in \mathcal{A}\}$ is the state transition probability. The goal is compute an optimal control policy ${\pi}^{*}:\mathcal{S}\to \mathcal{A}$ such that the expected total reward in the process
$$J=\mathbf{E}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t+1},{a}_{t})]$$  (1) 
is maximized, where $R(s,a):\mathcal{S}\times \mathcal{A}\to \mathbf{R}$ is the onestep reward function and $\gamma $ is the discount factor. In the Q learning framework, we define optimal stateaction value function (also known as the ${Q}^{*}$ function) by
$${Q}^{*}(s,a)={\mathbf{E}}^{\pi}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,{\pi}^{*}]$$  (2) 
which is the expected sum of rewards along the process by following the optimal policy ${\pi}^{*}$, after visiting state ${s}_{0}$ and taking action ${a}_{0}$. With ${Q}^{*}$ , the optimal policy ${\pi}^{*}$ is given by ${\pi}^{*}=\mathrm{arg}{\mathrm{max}}_{a}{Q}^{*}(s,a)$
2.2 MDP with stage structure
We introduce a stage structure into a goaloriented control tasks in the following way. For a control task with $N$ linear stages, we define $N$ state space subsets ${\mathcal{S}}_{1},\mathrm{\dots},{\mathcal{S}}_{N}$of $\mathcal{S}$ satisfying
$${\mathcal{S}}_{1}\subset {\mathcal{S}}_{2}\subset {\mathcal{S}}_{3}\mathrm{\cdots}{\mathcal{S}}_{N}=\mathcal{S}.$$ 
Given an initial state ${s}_{0}$, we define its associated stage index by $I({s}_{0})=mi{n}_{i}\{i{s}_{0}\in {\mathcal{S}}_{i}\}$. As we evolve the system state forward ${s}_{0}\to {s}_{1}\to \mathrm{\cdots}$, the change stage index of a state ${s}_{i}$ is determined by a stagetransition event. Let $I({s}_{t})=i$. We say the transition tuple $({s}_{t},{a}_{t},{s}_{t+1})$ triggers statetransition event such that $I({s}_{t+1}=i+1$ if ${s}_{t+1}\in {\mathcal{S}}_{i+1}{\mathcal{S}}_{i}$ (i.e., the system moves out of ${\mathcal{S}}_{i}$ and enters ${\mathcal{S}}_{i+1}$).
In SDQL, we consider infinite horizon episodic goaloriented tasks that have terminal states ${\mathcal{S}}_{T}\subset {\mathcal{S}}_{N}$. We assume the system state starting with ${\mathcal{S}}_{i}$ must follow consecutive stage transitions i, 2, …, N to reach the terminate states.
We seek the optimal control policies ${\pi}^{*}=\{{\pi}_{1}^{*},{\pi}_{2}^{*},\mathrm{\dots},{\pi}_{N}^{*}\}$ via $N$ stagewise optimal Q functions ${Q}_{1}^{*},\mathrm{\dots},{Q}_{N}^{*}$ such that ${\pi}_{i}(s)=\mathrm{arg}{\mathrm{max}}_{a}{Q}_{i}(s,a)$. Particularly, we define ${Q}_{i}^{*}$ to be associated with an episodic MDP with modified reward function. The optimal ${Q}_{i}^{*}$ function that starts with initial state ${s}_{0}\in {\mathcal{S}}_{i}$ and ends when $s\in {\mathcal{S}}_{i+1}$ or $s\in {\mathcal{S}}_{T}$, given by
$${Q}_{i}^{*}(s,a)=\underset{{\pi}_{i}}{\mathrm{max}}{\mathbf{E}}^{\pi}[\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{R}_{i}({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,{\pi}_{i}]$$  (3) 
where ${R}_{i}$ is the modified reward function for stage $i$. ${R}_{i}$ equals to $R$ for all stateaction pair $(s,a)$ that does not trigger stage transition and equal to $R$ plus additional stagetransition reward. More formally, let ${s}_{t}$ belongs to stage $i$, and the stage reward for transition pair $({s}_{t},{a}_{t},{s}_{t+1})$ is defined by
$${R}_{i}({s}_{t+1},{a}_{t})=\{\begin{array}{cc}R({s}_{t+1},{a}_{t}),\hfill & \text{if no stage transition}\hfill \\ R({s}_{t+1},{a}_{t})+\gamma {V}_{i+1}^{*}({s}_{t}),\hfill & \text{if stage transition occurs}\hfill \end{array}$$  (4) 
where ${V}_{i+1}^{*}({s}_{t})$ is the optimal state value function for stage $i+1$ given by ${V}_{i+1}^{*}({s}_{t})={\mathrm{max}}_{a}{Q}_{i+1}({s}_{t},a)$. Note that because state transition event will not occur when the system is in the last stage $N$, we have ${R}_{N}=R$. As we mention in the introduction, the introduction of statetransition rewards is the crucial step to ensure all subnetworks/Q functions are trained to maximize the total reward across all stages instead of the cumulative rewards in its own stage.
Notably, a MDP with stage structure defined above is consistent to an ordinary MDP in the following way:
Proposition 1.
Consider a $N$stage MDP. We have

•
${Q}_{i}^{*}(s,a)={Q}^{*}(s,a),\forall s\in {\mathcal{S}}_{i},i=1,\mathrm{\dots},N.$

•
${\pi}_{i}^{*}(s)={\pi}^{*}(s),\forall s\in {\mathcal{S}}_{i},i=1,\mathrm{\dots},N.$
Proof.
See Appendix. ∎
Proposition 1 shows that ${Q}_{1}^{*},\mathrm{\dots},{Q}_{N}^{*}$ in the stacking architecture is equivalent to ${Q}^{*}$ on different portion of $\mathcal{S}$. Therefore, although each subnetwork ${Q}_{i}^{*}$ are only responsible for learning and controlling the system at ${\mathcal{S}}_{i}$, all of subnetworks collectively learn the optimal control policy.
2.3 SDQL architecture and algorithm
In a typical deep Q learning[6], we use deep neural networks to approximate the optimal Q function, denoted by $Q(s,a;\theta )$, where $\theta $ is the neural network parameters. To improve sample efficiency, we usually employ offpolicy learning via experience replay buffer $D$[6]. In each iteration, we perform gradient descent using a minibatch of samples to update parameters $\theta $ with loss function
$${L}_{i}({\theta}_{i})={\mathbf{E}}_{(s,a,r,{s}^{\prime})\sim U(D)}[{(r+\gamma \underset{a}{\overset{\prime}{\mathrm{max}}}Q({s}^{\prime},{a}^{\prime};{\theta}_{i}^{})Q(s,a;{\theta}_{i}))}^{2}]$$  (5) 
where the subscript $i$ is iteration index, $(s,a,r,{s}^{\prime})$ is the experience tuple uniformly sampled from stored experience $D$, ${\theta}^{\prime}$ is the target network parameter that slowly synchronizes with the Q network parameter $\theta $.
In SDQL, as we decompose the control task into $N$ linear stages, we approximate the $Q$ function by $N$ neural networks, denoted by ${Q}_{1}(s,a;{\theta}^{(1)}),\mathrm{\dots},{Q}_{N}(s,a;{\theta}^{(N)})$. We also use $N$ experience replay buffers ${D}_{1},\mathrm{\dots},{D}_{N}$, with ${D}_{i}$ stores $(s,a,r,{s}^{\prime})$ and $I(s)=i$. We perform gradient descent using a minibatch of samples to update parameters ${\theta}^{(i)}$ with loss function
$${L}^{(i)}({\theta}_{j}^{(i})={\mathbf{E}}_{(s,a,r,{s}^{\prime})\sim U({D}_{i})}[{(r+\gamma \underset{a}{\overset{\prime}{\mathrm{max}}}{Q}_{i}({s}^{\prime},{a}^{\prime};{\theta}_{j}^{(i)}){Q}_{i}(s,a;{\theta}_{j}^{(i)}))}^{2}]$$ 
where the subscript $j$ is iteration index, $(s,a,r,{s}^{\prime})$ is the experience tuple uniformly sampled from the experience buffer ${D}_{i}$, ${\theta}^{(i)}$ is the target network parameter that slowly synchronizes with the Q network parameter ${\theta}^{(i)}$.
Because the reward in each stage depends on the learned optimal value function in subsequent stages. The iterative updates of Q functions should be performed backwards, starting from stage N to stage 1.
The architecture and the training algorithm allows stacking different types of Qlearning networks, including single deep Q network such as DQN[9], DuelingDQN[10], DoubleQ DQN[11], Rainbow[12] NAF[13], and actorcritic networks such as DDPG [7] and TD3 [14].
The complete algorithm is showed in Algorithm 1.
$${y}_{j}=\{\begin{array}{cc}{r}_{j},\hfill & \text{if episode terminates or state transitions at step}j+1\hfill \\ {r}_{j}+\gamma {\mathrm{max}}_{{a}^{\prime}}{Q}_{i}({s}_{j+1},{a}^{\prime};{\theta}^{(i)}),\hfill & \text{otherwise}\hfill \end{array}$$ 
Remark.

•
We initialize system state to be within the state space ${\mathcal{S}}_{k}{\mathcal{S}}_{k1}$, instead of the whole ${\mathcal{S}}_{k}$ is similar to the initialization in reverse curriculum learning[15].

•
In Proposition 1, we show that stagewise Q functions ${Q}_{1}^{*},\mathrm{\dots},{Q}_{N}^{*}$ are equivalent to ${Q}^{*}$ at different parts of the state space. As we use neural networks to approximate Q functions and the scheduled backward training, in practice, the approximation relation between ${Q}_{i}^{*}$ and ${Q}^{*}$ are more accurately described by
${Q}_{i}^{*}(s,a)\approx {Q}^{*}(s,a),$ $\forall s\in {\mathcal{S}}_{i}{\mathcal{S}}_{i1},i=2,\mathrm{\dots},N$ (6) ${Q}_{1}^{*}(s,a)\approx {Q}^{*}(s,a),$ $\forall s\in {\mathcal{S}}_{1}$ (7)
3 Examples
3.1 Multistage Grid World navigation
To demonstrate our algorithm, we first consider a toy program of multistage Grid World navigation. As showed in Fig. 2, an agent is within a 25$\times $25 Grid World and aims to navigate from the upper left corner to the target located at lower right corner. The Grid World is spatially divided into five parts, as annotated by the dashed lines, corresponding to the five stages in the whole navigation process. Let $(i,j)$ denote the position of the agent. The agent have four actions, which will enable the agent to move to $(i1,j),(i+1,j),(i,j1),(i,j+1)$ in one step, respectively. If an action causes the agent to move out of the boundary, the agent will not move. The stage transition from one stage to the next stage is triggered when the agent crosses the dashed boundaries.
We solve this navigation problem by using five DQN modules stacked together. The training process follows the algorithm 1 and starts from the back to the front. We consider two cases to examine how the learned control strategies are adapted to different reward specifications. In the first case, we use homogeneous discount factors, that is, the same $\gamma =0.99$ for all stages; in the second case, we use heterogeneous discount factors, $\gamma =0.99$ for the fourth stage and $\gamma =0.9$ for other stages.
The navigation trajectories from the trained multistage controller are showed in Fig. 2(b) and (c). For the homogeneous discount case, the agent takes steps to either move down or move right in all stages, which agrees with expectation. As a comparison, as the set a maximum discount factor in the fourth stage in the heterogeneous discount factor case, the agent goes up and right all the way from the bottom to the top at the fourth stage and moves straight right on other stages. In this way, the agent spent minimal time in all stages except for the fourth, which maximizes the final total reward. To further under how stagewise $Q$ functions adapt to the specified discount factors, we showed merged state value functions ^{1}^{1} 1 $V(s)={V}_{i}(s),\forall s\in {\mathcal{S}}_{i}{\mathcal{S}}_{i1},i=2,\mathrm{\dots},N;V(s)={V}_{1}(s),\forall s\in {\mathcal{S}}_{1}$ in Fig. 2(d) and (e). For the case of homogeneous discount factors, the value function agrees with the expectation such that state value decreases as distance to the target increases. For the case of heterogeneous discount factors, where the fourth stage has the largest discount factor, the state value decreases as distance to the target increases but decreases at a slower speed when the state is at the fourth stage.
3.2 Autonomous cooperative cargo transport
In the second example, we consider a problem in autonomous targeted nanodrug delivery and precision surgery[16, 17] using microrobots (with size typically ranges from 100nm to 10um). Stateoftheart approaches to targeted drug delivery involve using micro/nanorobots to carry nanosized drug to hardtoreach locations (e.g., tissues, cancer cells) and unload the drug there. Because of their miniaturized size, these robots in general do not carry energy supply by themselves but have the ability to directly utilize energy sources in the environment via many different mechanisms[18, 19, 20, 21] including external electrical and magnetic fields, chemical catalysis, optical field, acoustics, etc. Realistic applications generally requires longdistance travel in a large scale working environment (e.g., tissues, blood vessels) exhibiting strong variations in terms of physical and chemical properties (e.g., temperature, PH, osmotic pressure, etc). The diverse working environments make it infeasible to carry out drug delivery task using one type of micro/nanorobots. One strategy to overcome this limitation is to use multiple micro/nanorobots to cooperatively transport a cargo across different parts of the environment.
Fig. 3(a) shows a model obstacle environment for multiple microrobots to realize cooperative cargo transport. The model environment is divided into two parts (dashed line), corresponding the two stages in the transport. In the first stage, one robot is responsible to carry the cargo and unload at the border; in the second stage, the robot is responsible to continue with where the first robot has left and carry the cargo to the final specified target site. We also assume the two microrobots are functioned on different transport mechanisms and thus have different equations of motion. Specifically, the equation of motion of the robot at the first and second stage are given by
$\begin{array}{cc}\hfill {\partial}_{t}x& =v\mathrm{cos}(\theta )+{\xi}_{x}(t)\hfill \\ \hfill {\partial}_{t}y& =v\mathrm{sin}(\theta )+{\xi}_{y}(t)\hfill \\ \hfill {\partial}_{t}\theta & ={\xi}_{\theta}(t)\hfill \end{array}$  (8) 
and
$\begin{array}{cc}\hfill {\partial}_{t}x& ={v}_{m}ax\mathrm{cos}(\theta )+{\xi}_{x}(t)\hfill \\ \hfill {\partial}_{t}y& ={v}_{m}ax\mathrm{sin}(\theta )+{\xi}_{y}(t)\hfill \\ \hfill {\partial}_{t}\theta & =w+{\xi}_{\theta}(t)\hfill \end{array}$  (9) 
where $x$, $y$, and $\varphi $ denote the robot’s position and orientation, $w$ in the first equation of motion is the control input that controls the selfpropulsion direction. $w$ takes continuous value between $[1,1]$. $v$ in the second equation of motion is the control input that controls the selfpropulsion speed . $v$ takes binary values of $0$ and ${v}_{m}ax$, where ${v}_{m}ax$ is the maximum allowed speed. ${\xi}_{x},{\xi}_{y},{\xi}_{\theta}$ are independent zero mean white noise stochastic process satisfying $E[{\xi}_{\theta}(t){\xi}_{\theta}({t}^{\prime})]=2{D}_{r}\delta (t{t}^{\prime})$ and $E[{\xi}_{x}(t){\xi}_{x}({t}^{\prime})]=E[{\xi}_{y}(t){\xi}_{y}({t}^{\prime})]=2{D}_{t}\delta (t{t}^{\prime})$, and ${D}_{t}$ and ${D}_{r}$ are transnational and rotational diffusivity coefficients. Compared the traditional humansize robots, micro/nanorobots are usually underactuated (i.e., not all degrees of freedom can be controled) and subject to Brownian motion.
We refer to the robot in the first stage as slider robot because it can directly control its moving direction but not its propulsion speed. We also refer to the robot in the second stage as the selfpropelled robot because it can only control its propulsion speed but not its orientation. It has been showed in the previous studies[20, 22] that, a selfpropelled microrobot has to smartly exploit Brownian rotation and selfpropulsion decisions to realize efficient transport, while a constantly selfpropelled simply reduce to an effective random walk on longer time scale as its orientation fully randomizes[23]. In general, the slider robot have better mobility and we expect it will contribute more in the cooperative transport process.
Our multistage controller consists of a TD3 module [14] and a DQN module stacked together. For a given microrobot’s state at step $n$, denoted by ${s}_{n}=({x}_{n},{y}_{n},{\theta}_{n})$, the observation $\varphi ({s}_{n})$ at ${s}_{n}$ consists of pixellevel image of $10\times 10$ around the microrobot and the target’s coordinate in the microrobot’s own coordinate system.
Fig. 3(b) shows the example trajectories of the two microrobots transporting within the obstacle environment. As expected, the first slider robot will travel longer distance and finish its travel close to the target. The selfpropelled robot will continue at where slider ends and finish the rest of the transport. In this way, the two robots can cooperatively transport the cargo in minimum time.
3.3 Precision manipulator control
We now demonstrate SDQL can also be used to compute control policies for tasks involving multiscale dynamics. Consider the task of controlling a twoarm manipulator (4) to reach a prescribed target with high precision. Specifically, we denote the length of the two arms by ${L}_{1}=1$ and ${L}_{2}=1$, the target by $({x}_{t},{y}_{t})$, the position of the end effector by $({x}_{2},{y}_{2})$. The goal is achieved if the effectortarget distance $d\triangleq \sqrt{(}{({x}_{t}{x}_{2})}^{2}+{({y}_{t}{y}_{2})}^{2})\le 0.01$. The geometry relationship (4) gives the following relationship:
$\begin{array}{cc}\hfill {x}_{1}& ={L}_{1}\mathrm{cos}({\theta}_{1})\hfill \\ \hfill {y}_{1}& ={L}_{1}\mathrm{sin}({\theta}_{1})\hfill \\ \hfill {x}_{2}& ={x}_{1}+{L}_{2}\mathrm{cos}({\theta}_{1}+{\theta}_{2})\hfill \\ \hfill {y}_{2}& ={y}_{1}+{L}_{2}\mathrm{sin}({\theta}_{1}+{\theta}_{2})\hfill \end{array}$  (10) 
The control on the arm angle ${\theta}_{1},{\theta}_{2}$ can be formulated as a discretetime dynamic model given by
$${\theta}_{1}(t+\mathrm{\Delta}t)={\theta}_{1}+{u}_{1},{\theta}_{2}(t+\mathrm{\Delta}t)={\theta}_{2}+{u}_{2},$$  (11) 
where $\mathrm{\Delta}t$ is the control update time interval, ${u}_{1},{u}_{2}\in [{u}_{m}ax,{u}_{m}ax]$, and ${u}_{m}ax=0.25$ is the maximum allowed angle movement in one step.
We aim to obtain a control strategy that can move the twoarm configuration from an arbitrary initial configuration to a configuration such that the end effector is sufficiently close to the target position. To make our test more general, we only use the natural reward where the reward is 1 if $d\le 0.01$ and 0 otherwise.
Therefore, there exist two length scales separated in the control task: a large length scale when the end effector is far away from the target and large step size to get to the target; a small length scale when the end effector is near the target and small steps are preferred for fine adjustments.
Using single length scale via deep reinforcement learning can cause sparse reward issue (reaching the target is quite rare) if the length scale used is relatively large or cause temporally extended delayed reward (reaching the target takes considerably more steps) in the length scale used is relatively small. The lengthscale separation can be addressed by formulating the control task into a twostage control task that uses different length scale in each stage. We introduce the twostage stage structure in the following way. The first stage state space is characterized by $d\ge 0.2$ and the transition from the first stage to the second stage is triggered by $$. In the second stage, we set the maximum allowed action by 0.05 and set the normalizing factor for the distance by 0.05. The use of smaller action can also be interpreted as we are evolving the dynamics at a proportionally smaller time scale.
For the twostage control task, we stack two DDPG actorcritic in the SDQL architecture to seek the optimal control policy. The input state is the concatenation of $({x}_{1},{y}_{1})$, $({x}_{2},{y}_{2})$ and target distance to the end effector $(({x}_{t}{x}_{2}),({y}_{t}{y}_{2}))$.
By using a small length scale in the second stage, we increase the probability that the agent can receive positive signal in the second stage during its exploration and exploitation process. The learned value function in the second stage then can propagate towards to the first stage Q network via bootstrapping.
Fig. 4(b) shows the example trajectories of controlling the twoarm robot, starting from randomly initialized positions, to reach target position (1,1) with high precision.
4 Related Work
Endtoend deep reinforcement learning have recently achieved remarkable performance in various domains includes games[6][24], robotics[5], biomechanics[25], medicine[26], physics, and chemical science[27] etc. For multistage control tasks, although it is possible to use one endtoend large neural network that simultaneously obtain all policies for each stage, the training of such a giant complex neural network for control tasks with highdimensional state space and sparse rewards can be considerably brittle due to the deadly triad of nonlinear function approximator, bootstrapping, and offpolicy learning [28]. By introducing stage structure into single stage control task or exploits the existing stage structure, SDQL enables decomposition of long horizon complex task into smaller one, therefore stabilize the learning process compared to training a giant complex network.
The stacking structure and the backward training scheme resembles the reverse curriculum learning[8, 15, 29]. Curriculum learning usually employs a single network for the whole task and schedules the learning of tasks of different difficulties to accelerate the training process. During learning, tasks of different levels have to be sampled constantly and at increasing difficult levels to accerlate training and prevent to agent forget skills already learned. In SDQL, multiple small neural network are responsible for learning different skills, starting with easier ones (i.e., later stage), thus improving the stability in the training and avoid the forgetting problem.
Finally, SDQL can be viewed as a regularized version of hierarchical reinforcement learning [30][31][32][33]. Hierarchical reinforcement learning has to learn hierarchy and temporal abstraction from data, in the form of highlevel policy (meta controller), to select subgoals for the low level controller to achieve. SDQL, instead, directly exploits a given hierarchical structure.
5 conclusion
We have introduced SDQL, a flexible stacking architecture that enables application of deep Q learning to general multistage optimal control tasks. Our theoretical analysis shows h the decomposition of Q values in the stacking architecture can be achieved in a consistent way. A general off policy training algorithm is developed to train the stacked networks via backward statevalue propagation.
We used different examples to demonstrate the strategic application of SDQL to learn competitive strategies in problem with characteristics of highdimensional state space, heterogeneous action space(both discrete and continuous), multiple scales, and sparse and delayed rewards.
There can be different modification to our architecture. For example, different Q network can share the parameters of first several layer. Different enhancement to DON and DDPG can be readily applied. Depending on the application, parameters of later stage networks can be copied to early stage network as pertained parameters.
Appendix A Proof of Proposition 1
Proof.
We use mathematical induction to prove the proposition.
First consider the case of $I({s}_{0})=N$. Based on the definition of stage reward function ${R}_{N}=R$, we have
${Q}^{*}(s,a)$  $=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ]$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}{R}_{N}({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ]$  
$={Q}_{N}^{*}(s,a)$ 
Similarly, we have ${\pi}^{*}={\pi}_{N}^{*}$.
Suppose ${Q}_{i}^{*}(s,a)={Q}^{*}(s,a),\forall s\in {\mathcal{S}}_{i},i=K,\mathrm{\dots},N.$ and ${\pi}_{i}^{*}(s)={\pi}^{*}(s),i=1,\mathrm{\dots},N.$. Consider $I({s}_{0})=K1$, and stage transition occurs at step $M$ we have
${Q}^{*}(s,a)$  $=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ]$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{M1}}{\gamma}^{t}R({s}_{t+1},{a}_{t})+{\displaystyle \sum _{t=M}^{\mathrm{\infty}}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ,I({s}_{M})=K]$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{M1}}{\gamma}^{t}R({s}_{t+1},{a}_{t})+{\gamma}^{M}{\displaystyle \sum _{t=0}^{\mathrm{\infty}}}{\gamma}^{t}R({s}_{t+M+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ,,I({s}_{M})=K]$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{M1}}{\gamma}^{t}R({s}_{t+1},{a}_{t})+{s}_{0}=s,{a}_{0}=a,\pi ]+{\gamma}^{M}{V}_{K}^{*}({s}_{M})$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{K1}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ]+{\gamma}^{M}{V}_{K}^{*}({s}_{M})$  
$=\underset{\pi}{\mathrm{max}}{\mathbf{E}}^{\pi}[{\displaystyle \sum _{t=0}^{M}}{\gamma}^{t}{R}_{K1}({s}_{t},{a}_{t}){s}_{0}=s,{a}_{0}=a,\pi ]$  
$={Q}_{K1}^{*}(s,a)$ 
where we use the fact that
$$\underset{\pi}{\mathrm{max}}{E}^{\pi}[\sum _{t=M}^{\mathrm{\infty}}{\gamma}^{t}R({s}_{t+1},{a}_{t}){s}_{M},,I({s}_{M})=K]={V}^{*}({s}_{M})$$ 
and the definition of ${R}_{K1}$. ∎
References
 [1] Stein W Wallace and William T Ziemba. Applications of stochastic programming. SIAM, 2005.
 [2] Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke Van Hoof, and Jan Peters. Towards learning hierarchical skills for multiphase manipulation tasks. Proceedings  IEEE International Conference on Robotics and Automation, 2015June(June):1503–1510, 2015.
 [3] Richard S Sutton and Andrew G Barto. Reinforcement Learning : An Introduction Second Ed. 2015.
 [4] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
 [5] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. EndtoEnd Training of Deep Visuomotor Policies. Arxiv, page 6922, 2015.
 [6] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [7] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [8] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
 [9] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [10] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
 [11] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [12] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [13] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 [14] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477, 2018.
 [15] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse Curriculum Generation for Reinforcement Learning. (CoRL):1–14, 2017.
 [16] Jinxing Li, Berta Estebanfernández De Ávila, Wei Gao, Liangfang Zhang, and Joseph Wang. Micro / nanorobots for biomedicine : Delivery , surgery , sensing , and detoxification. (April 2018), 2017.
 [17] Joseph Wang and Wei Gao. Nano/microscale motors: Biomedical opportunities and challenges. ACS Nano, 6(7):5745–5751, 2012.
 [18] Thomas E Mallouk and Ayusman Sen. Powering Nanorobots. Scientific American, 300(5):72–77, 2009.
 [19] Jinxing Li, Isaac Rozen, and Joseph Wang. Rocket Science at the Nanoscale. 2016.
 [20] Yuguang Yang and Michael A Bevan. Optimal Navigation of SelfPropelled Colloids. ACS nano, 12(11):10712–10724, 2018.
 [21] Yuguang Yang, Michael A Bevan, and Bo Li. Efficient Navigation of Colloidal Robots in an Unknown Environment via Deep Reinforcement Learning. Advanced Intelligent Systems, 0(ja):1900106, 9 2019.
 [22] Tomoyuki Mano, JeanBaptiste Delfau, Junichiro Iwasawa, and Masaki Sano. Optimal runandtumble–based transportation of a Janus particle with active steering. Proceedings of the National Academy of Sciences of the United States of America, page 201616013, 2017.
 [23] Jonathan R. Howse, Richard A.L. Jones, Anthony J. Ryan, Tim Gough, Reza Vafabakhsh, and Ramin Golestanian. SelfMotile Colloidal Particles: From Directed Propulsion to Random Walk. Physical Review Letters, 99(4):3–6, 2007.
 [24] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [25] Siddhartha Verma, Guido Novati, and Petros Koumoutsakos. Efficient collective swimming by harnessing vortices through deep reinforcement learning. 2018.
 [26] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design. (July):1–15, 2018.
 [27] Zhenpeng Zhou, Xiaocheng Li, and Richard N Zare. Optimizing Chemical Reactions with Deep Reinforcement Learning. ACS Central Science, 3(12):1337–1344, 2017.
 [28] Joshua Achiam, Ethan Knight, and Pieter Abbeel. Towards Characterizing Divergence in Deep QLearning. arXiv preprint arXiv:1903.08894, 2019.
 [29] Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. TeacherStudent Curriculum Learning. 2017.
 [30] Pierre Luc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. 31st AAAI Conference on Artificial Intelligence, AAAI 2017, pages 1726–1734, 2017.
 [31] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta Learning Shared Hierarchies. pages 1–11, 2017.
 [32] Vijay Kumar, Mariarosaria Taddeo, Pierre E. Dupont, GuangZhong Yang, Robert Wood, Zhong Lin Wang, Peer Fischer, Manuela Veloso, Neil Jacobstein, Robert Merrifield, Bradley J. Nelson, Brian Scassellati, Jim Bellingham, Marcia McNutt, Robert Full, Luciano Floridi, and Russell Taylor. The grand challenges of Science Robotics . Science Robotics, 3(14):eaar7650, 2018.
 [33] Ofir Nachum and Honglak Lee. DataEfficient Hierarchical Reinforcement Learning. (Nips), 2018.