### Abstract

Reinforcement Learning in domains with sparse rewards is a difficult problem,and a large part of the training process is often spent searching the statespace in a more or less random fashion for any learning signals. For controlproblems, we often have some controller readily available which might besuboptimal but nevertheless solves the problem to some degree. This controllercan be used to guide the initial exploration phase of the learning controllertowards reward yielding states, reducing the time before refinement of a viablepolicy can be initiated. In our work, the agent is guided through an auxiliarybehaviour cloning loss which is made conditional on a Q-filter, i.e. it is onlyapplied in situations where the critic deems the guiding controller to bebetter than the agent. The Q-filter provides a natural way to adjust theguidance throughout the training process, allowing the agent to exceed theguiding controller in a manner that is adaptive to the task at hand and theproficiency of the guiding controller. The contribution of this paper lies inidentifying shortcomings in previously proposed implementations of the Q-filterconcept, and in suggesting some ways these issues can be mitigated. Thesemodifications are tested on the OpenAI Gym Fetch environments, showing clearimprovements in adaptivity and yielding increased performance in all roboticenvironments tested.

### Quick Read (beta)

# Accelerating Reinforcement Learning with Suboptimal Guidance

###### Abstract:

Reinforcement Learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for any learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. In our work, the agent is guided through an auxiliary behaviour cloning loss which is made conditional on a Q-filter, i.e. it is only applied in situations where the critic deems the guiding controller to be better than the agent. The Q-filter provides a natural way to adjust the guidance throughout the training process, allowing the agent to exceed the guiding controller in a manner that is adaptive to the task at hand and the proficiency of the guiding controller. The contribution of this paper lies in identifying shortcomings in previously proposed implementations of the Q-filter concept, and in suggesting some ways these issues can be mitigated. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotic environments tested.

glossary

First]Eivind Bøhn First,Second]Signe Moe Second]Tor Arne Johansen

${}^{*}$SINTEF Digital, Oslo, Norway

(e-mail: {eivind.bohn, signe.moe}@sintef.no). ${}^{}$

${}^{\mathrm{\u2020}}$Norwegian University of Science and Technology, Trondheim, Norway (email: {signe.moe, tor.arne.johansen}@ntnu.no) ${}^{}$

Keywords: Deep Reinforcement Learning, Non-Linear Control Systems, Robotics

^{1}

^{1}footnotetext: The first author is financed by ”PhD Scholarships at SINTEF” from the Research Council of Norway (grant no. 272402). The third author was supported by the Research Council of Norway (grant no. 223254 NTNU AMOS).\@xsect\Gls

rl (sutton_reinforcement_nodate) is a field of machine learning concerned with sequential decision making problems. The theory and methods in \glsrl has produced some impressive results the last few years, ranging from high-level reasoning tasks such as game-playing to control of fast dynamics such as actuation in robotics. \glsrl has received much attention and interest due to its framework being very general, in theory capable of solving nearly any problem as long as one can define some notion of utility of different states of the system. {textblock}150(50,290) This work has been submitted to IFAC for possible publication

This utility measure, called the reward function in the \glsrl framework, is central to both the definition of the problem and the performance of the algorithm. Often, a sparse reward function\@footnotemark\@footnotetextA sparse reward function is one that only yield signals in some subset of the state space, e.g. in the subset defined as the set of goal states, and is typically constant elsewhere. is preferable, as they are easier to formulate, easier to estimate, and more importantly, there is less room for ambiguity and misinterpreting the objective of the task. The obvious downside is however that only a small region of the problem space actually gives a learning signal to the agent, and a significant portion of the training time is spent exploring the state space (often in a random manner) until a minimally viable strategy (i.e. one that consistently is able to reach reward-yielding states) is found, from which further progress can be made.

Often in robotics and other control applications one already has a controller, which due to e.g. nonlinearities in the dynamics, uncertainties in model parameters, or strict computational requirements might be suboptimal. This controller could be used to guide the learning controller towards this minimally viable strategy, thereby reducing the long initial exploration phase. In the rest of this paper guiding controller and pre-existing controller is used interchangeably to refer to this controller.

The most naive implementation of \glsil, that is, trying to directly copy the pre-existing controller is seldom successful, mainly due to a data distribution mismatch. Furthermore, the potential performance of the learning controller is bound by the performance of the existing controller. The data distribution problem arises as the data collected by the demonstrator will only consist of a subset of the state space, and will offer no guidance on how to course correct back into this subset when the learning controller inevitably makes a small deviation from the controller it is imitating. Small errors therefore tend to accumulate into trajectories that stray far from those produced by the demonstrator. Additionally, when combining \glsil with \glsrl there needs to be some mechanism in place to allow the learning controller to make different choices than the original controller when appropriate, in order to exceed the performance of the existing controller.

The problem of leveraging information encoded in previous controllers or demonstration data and combining it with further learning such as \glsrl has been widely studied. This has led to the development of several different approaches, such as inverse reinforcement learning, \glsbc, and pretraining learning controllers with demonstration data. In our work, we tackle the problem of deciding when the learning controller should imitate the pre-existing controller, and when it should develop its own behaviour. We base our method in the concept of the Q-filter (nair_overcoming_2017), which makes the learning controller imitate the pre-existing controller only when it is deemed to yield a higher reward than what the learning controller would achieve. In this work we identify issues with the original formulation of the Q-filter and propose modifications to mitigate these issues, resulting in improved adaptivity to the difficulty of the task and to the proficiency of the pre-existing controller, and thus also improved performance.

The rest of the paper is organized as follows. First, related work is presented and discussed in Section Accelerating Reinforcement Learning with Suboptimal Guidance, then the requisite background theory for \glsrl and the algorithms used in this paper is presented in Section Accelerating Reinforcement Learning with Suboptimal Guidance. Our method is outlined in Section Accelerating Reinforcement Learning with Suboptimal Guidance, and the experiments used to evaluate the methods are outlined in Section Accelerating Reinforcement Learning with Suboptimal Guidance. The results of the experiments are shown in Section Accelerating Reinforcement Learning with Suboptimal Guidance, which also discusses the strengths and weaknesses of the proposed method. Finally, Section Accelerating Reinforcement Learning with Suboptimal Guidance concludes with our thoughts on the matter and suggestions for further work.

The DAGGER algorithm (ross_reduction_nodate) is an approach to mitigating the data distribution mismatch problem. A learning controller is trained on a dataset of experience, and then this learning controller is used in the environment to collect more data. The newly collected data is appended to the training set, and the demonstrator is asked to label the new data with the appropriate outputs. Our method builds upon the DAGGER framework in the sense that we assume access to an online demonstrator, and iteratively expand the dataset with the demonstrators knowledge. However, our learning controller is only made to mimic the demonstrator in some selected states and this is only one of its optimization objectives. Deeply AggreVaTeD (sun_deeply_2017) is a further extension of DAGGER to continuous state and action spaces and nonlinear function approximation with \glsplnn. Both these approaches are pure \glsil methods and thus do not allow for the agent to surpass the demonstrator.

ddpgfd (vecerik_leveraging_2018) is a method for leveraging demonstration data for \glsrl algorithms in domains with continuous state and action spaces. In this work, human generated demonstration are included in the replay buffer for which a prioritized sampling technique is used to ensure a suitable mix of demonstration data and self collected data in each training batch. They further use a mix of 1-step and n-step return losses for training the Q-networks. They show increased performance over the demonstrations, as well as over regular \glsddpg even when the latter uses hand-crafted shaped rewards and \glsddpgfd uses sparse rewards.

nair_overcoming_2017 propose to address the problem of choosing when the learning controller should emulate the demonstrator with the use of the Q-filter. The Q-filter is an indicator function used to select when to apply a behaviour cloning loss on the action chosen by the demonstrator. The filter evaluates to true when the estimated value from the Q-function is higher for the demonstrator action than it is for the action chosen by the learning controller. This provides a natural way to anneal behaviour cloning loss during the training process that is adaptive to the task at hand. Their demonstrations come in the form of a fixed set of trajectories collected by a human demonstrator in a virtual reality version of the environment. In each training batch, a portion of the data is sampled from the regular replay buffer and a portion is sampled from the demonstration buffer, on which \glsbc is applied for the actions selected by the Q-filter. Our method borrows the concept of the Q-filter from this work, suggesting some important adjustments to the concept. Further, their source of demonstrations consists of a fixed dataset, while we assume online access to a demonstrator controller.

In \glsasddpg xie_learning_2018 propose a novel architecture for incorporating a simple controller to guide the initial exploration phase. In their architecture, the critic module has an additional branch that estimates the Q-values of the pre-existing controllers action and the actors action for the state in question. Based on these estimates the method greedily chooses the one with the highest Q-value, and applies this action as the exploration action during training. In this way, the guiding controller is used to show the actor some primitives it can use to aid exploration. As in our method, online access to the guiding controller is assumed, but the Q-filter is applied to select actions for exploration instead of to shape the gradients of the optimization procedure.

We consider the \glsrl problem in a \glsmdp framework. The \glsmdp is defined by the tuple $\u27e8\mathcal{S},\mathcal{A},R,\mathcal{T},\gamma \u27e9$, where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $R(s,a)$ is the reward function, $\mathcal{T}$ is the state transition function describing the evolution of the states as a function of time and actions, and $\gamma \in [0,1]$ is the discount factor, weighing the relative importance of immediate and future rewards. We denote a trajectory $\tau $ as a sequence of state and actions, e.g. one episode of the problem task, and the return as $R(\tau )={\sum}_{({s}_{t},{a}_{t})\sim \tau}{\gamma}^{t}R({s}_{t},{a}_{t})$, where $\sim $ signifies that the left hand side is distributed according to the distribution of the right hand side. The \glsrl objective is to find a policy $\pi $, i.e. a function that generates actions from states, which maximizes the sum of expected future rewards weighed by the discount factor $\gamma $, i.e. the expected return in an episodic setting. We denote this objective as ${\mathrm{max}}_{\theta}J(\theta )={\mathrm{\mathbb{E}}}_{\tau \sim {\pi}_{\theta}}\left[R(\tau )\right]$, where $\theta $ is the parameterization of the policy. Central to many \glsrl algorithm is the concept of the value of a state, given by the value function ${V}^{\pi}(s)={\mathrm{\mathbb{E}}}_{\tau \sim \pi}[R(\tau )|{s}_{0}=s]$. The value function measures the expected return of being in a given state and from there always acting according to the policy $\pi $. If we further leave the first action in the trajectory free, we get the state-action value function, often called the Q-function ${Q}^{\pi}(s,a)={\mathrm{\mathbb{E}}}_{\tau \sim \pi}[R(\tau )|{s}_{0}=s,{a}_{0}=a]$. We consider in this work a further extension to the \glsmdp framework in the form of a specified goal state, from a set of possible goals $\mathcal{G}$. schaul_universal_nodate show that value functions conditioned on this additional goal parameter can successfully be trained to generalize to unseen goals.

To address the challenges posed by tasks with continuous state and action spaces, lillicrap_continuous_2015 introduced \glsddpg. \glsddpg is an off-policy, model-free, policy-gradient, actor-critic algorithm which concurrently estimates the Q-function ${Q}^{{\pi}_{\theta}}$ and an actor that aims at solving for the actions which maximize the Q-function. These actor and critic functions are implemented as \glsplnn, and are trained off-policy from experience collected by the actor interacting with the environment. This experience is stored in a replay buffer $\mathcal{D}$, from which we sample minibatches $\mathcal{B}$ to train the actor and the critic. More concretely, the critic is trained in a supervised manner to regress on 1-step returns according to the Bellman optimality equation:

${y}_{t}$ | $={R}_{t}+\gamma Q({s}_{t+1},\pi ({s}_{t+1}|{\theta}^{Q}))$ | (1) | ||

$\mathcal{L}({\theta}^{Q})$ | $={\mathrm{\mathbb{E}}}_{({s}_{t},{a}_{t},{R}_{t},{s}_{t+1})\sim \mathcal{B}}\left[{({y}_{t}-Q({s}_{t},{a}_{t}|{\theta}^{Q}))}^{2}\right]$ | (2) |

where ${\theta}^{Q}$ is the parameterization of the Q-function. The actor attempts to find the optimal policy according to the critic through the relationship (Accelerating Reinforcement Learning with Suboptimal Guidance). Since this actor is deterministic, a stochastic behaviour policy (Accelerating Reinforcement Learning with Suboptimal Guidance) is used to collect experience in order to improve exploration, which is obtained by adding Gaussian noise with parameters selected to suit the problem. Finally, the actor is trained by optimizing the objective (Accelerating Reinforcement Learning with Suboptimal Guidance).

$\pi (s)$ | $=\text{arg}\underset{a}{\mathrm{max}}Q(s,a)$ | (3) | ||

${\pi}^{b}(s)$ | $=\pi (s)+\mathcal{N}(\mu ,\sigma )$ | (4) | ||

$\mathcal{L}(\theta )$ | $=\underset{\theta}{\mathrm{max}}{\mathrm{\mathbb{E}}}_{s\sim \mathcal{D}}\left[{Q}_{{\theta}^{Q}}(s,{\pi}_{\theta}(s))\right]$ | (5) |

td3 (fujimoto_addressing_2018) improves upon \glsddpg in several ways, notably by alleviating sources of overestimation in the critic network, and introduces more robustness towards hyperparameters. It does this by maintaining two separate networks estimating the critic Q-function, and using the lesser of these two as the target for regression (Accelerating Reinforcement Learning with Suboptimal Guidance). Further they suggest updating the policy networks less frequently to allow the value functions more time to converge before they are used to update the policy.

The environments considered in this paper have sparse reward structures, requiring a directed effort over several actions to reach regions of the state space where any learning signal is received at all. To reach learning signals in such scenarios more quickly, andrychowicz_hindsight_2017 proposed \glsher, in which the experience contained in the replay buffer is retroactively fitted with a new goal state that was actually achieved sometime during the episode. In this way, by pretending that some state the actor was successful in achieving was the actual goal state, the algorithm receives more frequent learning signals and convergence is accelerated.

In our setup we assume online access to a pre-existing sub-optimal controller which is capable of achieving goal states from a non-zero share of initial states. We also assume that we have access to the Q-function of this controller, denoted by ${Q}^{G}$. However, we do not assume prior access to demonstration trajectories from this controller.

We modify the actor objective function of \glsddpg style algorithms (Accelerating Reinforcement Learning with Suboptimal Guidance) by adding a \glsbc loss term, and as in nair_overcoming_2017 we only selectively apply this loss with the indicator function conditioned on the Q-filter:

${\mathcal{L}}_{BC}$ | $={\displaystyle \sum _{i=0}^{{N}_{b}}}{\parallel {a}_{i}-{\pi}_{\theta}({s}_{i})\parallel}_{2}{\mathbb{\U0001d7d9}}_{{Q}^{G}({s}_{i},{a}_{i})>{Q}^{{\pi}_{\theta}}({s}_{i},{\pi}_{\theta}({s}_{i}))}$ | (6) | ||

$\text{where}{\mathbb{\U0001d7d9}}_{AB}=\{\begin{array}{cc}1\mathit{\hspace{1em}}\text{if}AB\hfill & \\ 0\hspace{1em}\text{otherwise}\hfill & \end{array}$ | (7) | |||

${Q}^{{\pi}_{{\theta}_{0}}}$ | $\leftarrow {Q}^{G}$ | (8) |

Here ${N}_{b}$ is the number of samples in each training batch, i.e. we use the online availability of the pre-existing controller to apply this term to all samples in the batch. This does not add considerable overhead, as the pre-existing controllers action can be evaluated once and then saved to the replay buffer alongside the other data.\@footnotemark\@footnotetextTraining with a guiding controller increases the wall clock time of the training process by about 5% compared with normal training in our experiments. In this work we suggest to modify the original formulation of the Q-filter by replacing the left hand side of the condition with the pre-existing controllers own estimated Q-function, which is kept static throughout the training process. As the \glstd3 algorithm has two separate Q-networks, one can look at different ways of combining these to make up the Q-filter. In this work we have chosen to only use the Q-network that is used in the optimization objective (Accelerating Reinforcement Learning with Suboptimal Guidance) as a basis for the comparison.

In order for the actors and the guiding controllers Q-functions to have comparable magnitudes, the actors Q-network is initialized to that of the guiding controller (Accelerating Reinforcement Learning with Suboptimal Guidance). This also provides a starting point that should be closer to optimal than a random initialization.

The motivation behind this modification is based on two key insights about the formulation of the original Q-filter in nair_overcoming_2017: First, for their Q-filter to have its intended effects the Q-network need to have converged, otherwise the comparison of actions will be highly stochastic. Moreover, the action frequency in control applications is typically high, thereby only to a small degree affecting the subsequent state, thus the sides of the inequality will differ principally in the immediate reward, which in most cases will also be similar. The quantities that are compared will thus be very similar and the action discrimination is therefore further stochastisized. Second, the actor in \glsddpg style algorithms directly implements a maximization over actions of the Q-function, and will therefore quickly learn to optimize the unconverged Q-network, therefore seemingly (most times erroneously) offering better options than the pre-existing controller. Their implementation will henceforth be referred to as the naive method.

In practice, the Q-filter therefore tends to prefer the guiding controllers actions infrequently in the beginning, and converge to imitate the guiding controller with some non-zero frequency. We want the learning controller to aim to emulate the pre-existing controller to a large degree in the beginning, and then have the \glsbc loss taper off to allow the learning controller to surpass the pre-existing controller. With the original formulation of the Q-filter however, the \glsbc loss does not seem to be strategically applied. We therefore suggest using a different Q-function to assess the value of the pre-existing controllers actions, which is found by fitting a Q-network to data gathered by this controllers actions. This Q-function provides a more reliable estimate of the guiding controller actions true value right from the start, and also provides a static goal to compare against as the training progresses.

The guiding controllers are handcrafted proportional controllers with some conditional logic, e.g. move over block, then lower hand and grip block etc. The controllers exhibit varying proficiency levels, from a perfect\@footnotemark\@footnotetextPerfect in the sense that it is able to complete all tasks in the test set. controller capable of achieving any goal in the test set for the pick-and-place environment, to a controller achieving merely 21% of the goals in the slide task.

We train and evaluate our method on the OpenAI Gym Fetch environments (plappert_multi-goal_2018), which are based on the MuJoCo physics simulator (mujoco). The FetchReach environment is too easy to solve for the \glsrl algorithms even without the use of a guiding controller, and as such has been excluded from the experiments. For each environment a test set consisting of 100 random initial states and goal states has been constructed, and the agents are evaluated on this set every 20k time steps. As a baseline, we evaluate our method against the original formulation of the Q-filter, as well as against an approach where the weight of the \glsbc loss is linearly decayed with time steps. We run each experiment with five different random seeds and show mean results with one standard deviation confidence bounds.

Due to limited computational resources we employed the \glstd3 algorithm instead of \glsddpg, as convergence of the latter is often dependent on running several actors in parallel. Following the original paper introducing the Fetch environments (andrychowicz_hindsight_2017), we use a decay factor of $0.98$ and clip the regression Q-targets (Accelerating Reinforcement Learning with Suboptimal Guidance) to the ranges possible in the environment, and also continue bootstrapping on environment termination due to time limit. We use the hyperparameters from \glstd3, i.e. two fully-connected hidden layers of 400 and 300 nodes respectively, a learning rate of 0.001 for both actor and critic, and a replay buffer size of ${10}^{6}$. Training is run every 1000 time steps of the environment for 1000 gradient steps, with a batch size of 100. We apply a small ${\mathcal{L}}_{2}$ regularization term with strength $0.01$ to the pre-activation values of the actors output layer, in order to mitigate vanishing gradients issues. The \glsbc loss term (Accelerating Reinforcement Learning with Suboptimal Guidance) is weighted with a factor of 2.

The linear schedule is set to be fully decayed after 500k time steps, with an initial value equal to that of the other methods. This schedule is not optimal for all environments, but a static schedule was chosen to highlight some of the issues with this approach.

It is conceivable that the improvements of the proposed methods stem mainly from the knowledge encoded into ${Q}^{G}$ and the initialization of ${Q}^{{\pi}_{\theta}}$ to this. A pretrained ${Q}^{G}$ already contains information about which action would maximize the rewards when further following the same action choices as the pre-existing controller at each state, and the optimization procedure implemented by the actor could take advantage of this information to achieve some of the same benefits that the \glsbc loss offers. To test this hypothesis, we ran one version of our method without the \glsbc loss term, and one version of the naive method which was initialized to ${Q}^{G}$ as in (Accelerating Reinforcement Learning with Suboptimal Guidance).

Figure 1 illustrates the unwanted behaviours of the naive Q-filter as described in section Accelerating Reinforcement Learning with Suboptimal Guidance: it typically starts low and converges to some value in the range of 0.3 - 0.5. Our method on the other hand delivers on its promises of adaptivity: it copies a large portion of the actions produced by the perfect guiding controller in the pick-and-place environment, selects only some of the actions produced by the pre-existing controllers in the slide and push environments, and decreases the frequency of imitation as the agent improves. The two Q-filter approaches have the same asymptotic performance for the push and slide environments, but our Q-filter method has considerable more success in the pick-and-place environment. The linear method matches the performance of the adaptive methods for the push environment, but falls a bit short in the pick-and-place environment, and is only able to find any success for a single seed in the most difficult slide task. The linear scheduling method seems to be in general the method with the quickest initial learning, due to indiscriminately imitating all actions from the guiding controller at the start, while the two Q-filter approaches seem to share a similar time at which improvements begin.

While our implementation of the Q-filter does lead to more deliberate application of the \glsbc loss, it does introduce a credit assignment problem. Since the two Q-functions that are compared correspond to two different controllers, it is unclear which actions in the action sequences generated by the controllers are responsible for the observed difference in value. Imitating only the initial action is therefore a potential source of error, but introducing some randomness and perturbation to the training process in \glsrl is a known measure to avoid local minima, and therefore this might not be a significant issue.

Even though the linear scheduling method is less complex than the other methods it can sometimes achieve similar or better performance in terms of takeoff time and asymptotic performance. The slide environment clearly illustrates that this approach require careful tuning, as the method is in this case unable to represent a policy capable of achieving any goals for most seeds. The Q-filters on the other hand are adaptive and provide reasonable performance for any task with little tuning, and in particular our method will imitate a good guiding controller to a larger degree than a bad guiding controller, unlike the linear scheduling method. Our version of the Q-filter does however require additional information in the form of ${Q}^{E}$, and the linear method might therefore be preferable in some scenarios despite its limitations.

Figure 2 shows the results of the ablation experiments. This experiment shows that initializing to ${Q}^{G}$ is not enough in and of itself to produce the results in Figure 1, and that the \glsbc loss term is a vital part of the method. The knowledge contained in ${Q}^{G}$ is likely quickly lost through training on the limited data available in the beginning, and therefore does not significantly add to the performance. Figure 2 also shows that adding initialization to the naive version of the Q-filter does not alter the performance to a significant degree, further supporting the claim that the observed improvements in our method stem from an improved comparison in the Q-filter.

The Q-function of the pre-existing controller need not be be obtained in a complicated manner such as Q-learning. Often, one has some historical data of usage of this controller, which one could use to estimate the expected sum of rewards in an entirely offline manner. Furthermore, one might not need a function approximator as complicated as a \glsnn for this purpose, and some simpler methods such as linear regression might be sufficient. This would remove the possibility of directly initializing the Q-function of the actor to that of the pre-existing controller, but one could simply pretrain the Q-network in a supervised manner on outputs from the pre-existing controllers Q-function to achieve much of the same effect.

We have shown that our method is capable of accelerating learning using guiding controllers with a wide spread of proficiency levels. An important further work is looking into how the optimality of the guiding controller affects the rate of convergence and asymptotic performance of the learning controller.

Having a pre-existing controller available online opens up many possibilities. If one knowns the stability properties of this controller, one can have the agent explore freely while still in the \glsroa of the guiding controller, and then have the guiding controller assume control and stabilize the system when necessary. In this way one can achieve a form of safe training, in the sense that only safe regions of the state space is visited, and risk of damage to the system is minimized. An interesting further work is a study on what fraction of actions the agent needs to control itself for learning to be successful in such a scenario.