Abstract
Manipulation tasks such as preparing a meal or assembling furniture remainhighly challenging for robotics and vision. The supervised approach ofimitation learning can handle short tasks but suffers from compounding errorsand the need of many demonstrations for longer and more complex tasks.Reinforcement learning (RL) can find solutions beyond demonstrations butrequires tedious and taskspecific reward engineering for multistep problems.In this work we address the difficulties of both methods and explore theircombination. To this end, we propose a RL policies operating on pretrainedskills, that can learn composite manipulations using no intermediate rewardsand no demonstrations of full tasks. We also propose an efficient training ofbasic skills from few synthetic demonstrated trajectories by exploring recentCNN architectures and data augmentation. We show successful learning ofpolicies for composite manipulation tasks such as making a simple breakfast.Notably, our method achieves high success rates on a real robot, while usingsynthetic training data only.
Quick Read (beta)
Combining learned skills and reinforcement learning
for robotic manipulations
Abstract
Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. The supervised approach of imitation learning can handle short tasks but suffers from compounding errors and the need of many demonstrations for longer and more complex tasks. Reinforcement learning (RL) can find solutions beyond demonstrations but requires tedious and taskspecific reward engineering for multistep problems. In this work we address the difficulties of both methods and explore their combination. To this end, we propose a RL policies operating on pretrained skills, that can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. We also propose an efficient training of basic skills from few synthetic demonstrated trajectories by exploring recent CNN architectures and data augmentation. We show successful learning of policies for composite manipulation tasks such as making a simple breakfast. Notably, our method achieves high success rates on a real robot, while using synthetic training data only.
Op r() #1(∣) \NewDocumentCommand\p r() \pr[p](#1) \NewDocumentCommand\q r() \pr[q](#1) \NewDocumentCommand\Normal r() \pr[Normal](#1) \NewDocumentCommand\Cat r() \pr[Cat](#1) \NewDocumentCommand\Bin r() \pr[Bin](#1) \NewDocumentCommand\Beta r() \pr[Beta](#1) \NewDocumentCommand\Bernoulli r() \pr[Bernoulli](#1) \NewDocumentCommand\Dir r() \pr[Dir](#1)
I Introduction
In this work we consider visually guided robotics manipulations and aim to learn visuomotor control policies to solve particular tasks. One approach to address this problem is imitation learning (IL) [1, 2, 3, 4] that aims to mimic sequences of actions from expert demonstrations. Such a supervised approach is efficient for learning short and simple tasks of limited variability. One drawback of IL is its difficulty to handle new states that have not been observed during demonstrations. While increasing the number of demonstrations helps to alleviate this issue, an exhaustive sampling of action sequences and scenarios becomes impractical for long and complex tasks.
Reinforcement learning (RL) is a complementary approach to IL that requires little supervision and achieves excellent results for some challenging tasks [5, 6]. RL explores previously unseen scenarios and, hence, can generalize beyond expert demonstrations. As full exploration is exponentially hard and becomes impractical for problems with long horizons, RL often relies on careful engineering of intermediate reward functions designed for specific tasks.
Common tasks such as preparing food or assembling furniture require long sequences of steps composed of many different actions. Such tasks have long horizons and, hence, are difficult to solve both with RL and IL methods. To address this issue, we propose a hierarchy of RL and imitationbased skills. Our approach aims to simplify RL by reducing its exploration to sequences with a limited number of primitive actions, that we call skills.
Given a set of pretrained skills such as "grasp a cube" or "pour from a cup", we train RL with sparse binary rewards corresponding to the correct/incorrect execution of the full task. While hierarchical policies have been proposed in the past [7, 8], our approach can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. Given such properties, the proposed method can be directly applied to learn new tasks. See Figure 1 for the overview of our approach.
Our skills are lowlevel visuomotor controllers learned from synthetic demonstrated trajectories with behavioral cloning (BC) [1]. Examples of skills include go to the bowl, grasp the object, pour from the held object, release the held object, etc.
We automatically generate expert synthetic demonstrations and learn corresponding skills in simulated environments. We also minimize the number of required demonstrations by choosing appropriate CNN architectures and data augmentation methods. Our approach is shown to compare favorably to the state of the art [3] on the FetchPickPlace test environment [9]. Moreover, using recent techniques for domain adaptation [10] we demonstrate high success rates for task execution on a real robot while training all policies in a simulator.
In summary, this work makes the following contributions. (i) We propose a new hierarchical combination of RL policies and IL skills to address composite tasks. (ii) We present sample efficient training of BC skills and demonstrate an improvement compared to the state of the art. (iii) We demonstrate successful learning of relatively complex manipulation tasks without neither intermediate rewards nor full demonstrations. (iv) We execute policies learned in simulation to solve multistep tasks successfully on a real robot. Our simulation environments together with the code and models used in this work will become publicly available.
II Related work
Our work is related to robotics manipulation such as grasping [11], opening doors [12], screwing the cap of a bottle [13] and cube stacking [14]. Such tasks have been addressed by various methods including imitation learning (IL) [15] and reinforcement learning (RL) [16].
Imitation Learning learns to solve a task by observing demonstrations. Approaches include behavioral cloning (BC) [1] and inverse reinforcement learning [4]. BC learns a function that maps states to expert actions [2, 3], whereas inverse reinforcement learning learns a reward function from demonstrations in order to solve the task with RL [17, 18, 14]. BC typically requires a large number of demonstrations and has issues with solving longhorizon problems. While these problems might be solved with additional expert supervision [2] or noise injection in expert demonstrations [19], we address them by improving the standard BC framework. We use recent stateoftheart CNN architectures and data augmentation for expert trajectories. Combining these allows to significantly reduce the number of required demonstrations and to improve performance.
Reinforcement Learning learns to solve a tasks without demonstrations using exploration. Despite impressive results in several domains [6, 5, 20, 12], RL methods show limited capabilities when operating in longhorizon and sparsereward environments common in robotics. Moreover, RL methods typically require prohibitively large amounts of interactions with the environment during training. Hierarchical RL (HRL) methods alleviate some of these problems by learning a highlevel policy modulating lowlevel workers. HRL approaches are generally based either on options [21] or a feudal framework [22]. The option methods learn a master policy that switches between separate skill policies [23, 24, 25, 26]. The feudal approaches learn a master policy that modulates a lowlevel policy by a control signal [27, 28, 29, 30, 31]. Our approach is based on options but in contrast to the cited methods, we pretrain the skills with IL. This allows us to solve longhorizon and sparse reward problems using significantly less interactions with the environment during training.
Combinations of RL and IL have been introduced recently. Gao et al.[32] use demonstrations to initialize the RL agent. [33, 34] use RL to improve expert demonstrations, but do not learn hierarchical policies. Demonstrations have been also used to define RL objective functions [35, 36] and rewards [37]. Das et al. [7] combines IL and RL to learn a hierarchical policy. Unlike our method, however, [7] requires full task demonstrations and taskspecific reward engineering. Moreover, the addressed navigation problem in [7] has a much lower time horizon compared to our tasks. [7] also relies on pretrained CNN representations which limits its application domain. Le at al. [8] train lowlevel skills with RL, while using demonstrations to switch between skills. In a reverse manner, we use IL to learn lowlevel control and then deploy RL to find appropriate sequences of pretrained skills. The advantage is that our method can learn a variety of longhorizon manipulations without full task demonstrations. Moreover, [7, 8] learn discrete actions and cannot be directly applied to robotics manipulations that require continous control.
In summary, none of the methods [7, 8, 33, 34] is directly suitable for learning longhorizon robotic manipulations due to requirements of dense rewards [7, 34] and state inputs [33, 34], limitations to short horizons and discrete actions [7, 8], the requirement of full task demonstrations [7, 8, 33, 34] and the lack of learning of visual representations [7, 33, 34]. Moreover, our skills learned from synthetic demonstrated trajectories outperform RL based methods, see Section IVF.
III Approach
Our HRLBC approach aims to learn multistep policies by combining hierarchical reinforcement learning (HRL) and pretrained skills obtained with behavioral cloning (BC). We present BC and HRLBC in Sections IIIA and IIIB. Implementation details are given in Section IIIC.
IIIA Skill learning with behavioral cloning
Our first goal is to learn basic skills that can be composed into more complex policies. Given observationaction pairs $\mathcal{D}=\{({o}_{t},{a}_{t})\}$ along expert trajectories, we follow the behavioral cloning approach [1] and learn a function approximating the conditional distribution of the expert policy ${\pi}_{\text{E}}({a}_{t}{o}_{t})$ controlling a robot arm. Our observations ${o}_{t}\in \mathcal{O}={\mathbb{R}}^{\text{H}\times \text{W}\times \text{M}}$ are sequences of the last $M$ depth frames. Actions ${a}_{t}=({\mathbf{v}}_{t},{\bm{\omega}}_{t},{g}_{t})$, ${a}_{t}\in {\mathcal{A}}^{\text{BC}}$ are defined by the endeffector linear velocity ${\mathbf{v}}_{t}\in {\mathbb{R}}^{3}$ and angular velocity ${\bm{\omega}}_{t}\in {\mathbb{R}}^{3}$ as well as the gripper openness state ${g}_{t}\in \{0,1\}$.
We learn the deterministic skill policies ${\pi}_{s}:\mathcal{O}\to {\mathcal{A}}^{\text{BC}}$ approximating the expert policy ${\pi}_{\text{E}}$. Given observations ${o}_{t}$ with corresponding expert (ground truth) actions ${a}_{t}=({\mathbf{v}}_{t},{\bm{\omega}}_{t},{g}_{t})$, we represent ${\pi}_{s}$ with a convolutional neural network (CNN) and learn network parameters $\theta $ such that predicted actions ${\pi}_{s}({o}_{t})=({\widehat{\mathbf{v}}}_{t},{\widehat{\bm{\omega}}}_{t},\widehat{{g}_{t}})$ minimize the loss
${L}_{\text{BC}}({\pi}_{s}\left({o}_{t}\right),{a}_{t})=\lambda $  ${\parallel [{\widehat{\mathbf{v}}}_{\mathbf{t}},{\widehat{\bm{\omega}}}_{t}][{\mathbf{v}}_{t},{\bm{\omega}}_{t}]\parallel}_{2}^{2}+$  (1)  
$\left(1\lambda \right)$  $\left({g}_{t}\mathrm{log}\widehat{{g}_{t}}+\left(1{g}_{t}\right)\mathrm{log}\left(1\widehat{{g}_{t}}\right)\right)$ 
where $\lambda \in [0,1]$ is a scaling factor which we set to 0.9 (chosen empirically).
Our network architecture is presented in Figure 1(right). When training several skill policies ${\pi}_{s}^{1},\mathrm{\dots},{\pi}_{s}^{K}$ such as reaching, grasping or pouring, we share parameters $\theta $ of the $CNN(\theta )$ and add a separate branch with convolutional layers $CNN({\eta}^{i})$ for each skill $i$. We use the same architecture to learn the master policy ${\pi}_{m}$ as explained in Section IIIB.
IIIB HRLBC approach
We wish to solve composite manipulations without full expert demonstrations and with a single sparse reward. For this purpose we design highlevel master policies ${\pi}_{m}$ controlling the pretrained skill policies ${\pi}_{s}$ at a coarse timescale. To learn ${\pi}_{m}$, we follow the standard formulation of reinforcement learning and maximize the expected return ${\mathbb{E}}_{\pi}{\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k}$ given rewards ${r}_{t}$. Our reward function is sparse and returns $1$ upon successful termination of the task and $0$ otherwise. The RL master policy ${\pi}_{m}:\mathcal{O}\times {\mathcal{A}}^{\text{RL}}\to [0,1]$ chooses one of the $K$ skill policies to execute the lowlevel control, i.e., the action space of ${\pi}_{m}$ is discrete: ${\mathcal{A}}^{\text{RL}}=\{1,\mathrm{\dots},K\}$. Note, that our sparse reward function makes the learning of deep visual representations challenging. We therefore train ${\pi}_{m}$ using visual features obtained from the BC pretrained $CNN(\theta )$ as illustrated in Figure 1(right).
To solve composite tasks with sparse rewards, we use a coarse timescale for the master policy. The selected skill policy controls the robot for $n$ consecutive timesteps before the master policy is activated again to choose a new skill. This allows the master to focus on highlevel decisions rather than lowlevel control. At the same time, we expect the master policy to recover from unexpected events, for example, if an object slips out of a gripper, by activating an appropriate skill policy. Our hierarchical combination of the master and skill policies is illustrated in Figure 1(left).
HRLBC algorithm..
The pseudocode of the proposed approach is shown in Algorithm IIIB. The algorithm can be divided into three main steps. First, we collect a dataset of expert trajectories ${\mathcal{D}}_{k}$ for each skill policy ${\pi}_{s}^{k}$. For each policy, we use an expert script that has an access to the full state of the environment. Next, we train a set of skill policies $\{{\pi}_{s}^{1},\mathrm{\dots},{\pi}_{s}^{K}\}$. We sample a batch of stateaction pairs and update parameters of convolutional layers $\theta $ as well as skillspecific parameters ${\eta}^{i}$ for skills $i=1,\mathrm{\dots},K$. Finally, we learn the master ${\pi}_{\mu}$ using the pretrained skill policies and the frozen parameters $\theta ,\eta $. We collect episode rollouts by first choosing a skill policy with the master and then applying the selected skill to the environment for $n$ timesteps (lines 1314). We update the master policy weights $\mu $ by maximizing the expected sum of rewards.
{algorithmic}[1] \State*** Collect expert data *** \For$k\in \{1,\mathrm{\dots},K\}$ \StateCollect an expert dataset ${\mathcal{D}}_{k}$ for the skill policy ${\pi}_{s}^{k}$ \EndFor\State*** Train $\{{\pi}_{s}^{1},\mathrm{\dots},{\pi}_{s}^{K}\}$ by solving: *** \State$\theta ,\eta =\mathrm{arg}{\mathrm{min}}_{\theta ,\eta}{\sum}_{k=1}^{K}{\sum}_{({o}_{t},{a}_{t})\in {\mathcal{D}}_{k}}{L}_{\text{BC}}({\pi}_{s}^{k}({o}_{t}),{a}_{t})$ \Whiletask is not solved \State*** Collect data for the master policy *** \State$\mathcal{E}=\{\}$ \CommentEmpty storage for rollouts \For$\text{episode\_id}\in \{1,\mathrm{\dots},\text{ppo\_num\_episodes}\}$ \State${o}_{0}=\text{new\_episode\_observation}()$ \State$t=0$ \Whileepisode is not terminated \State${k}_{t}\sim {\pi}_{m}({o}_{t})$ \CommentChoose the skill policy \State${o}_{t+n},{r}_{t+n}=\text{perform\_skill}({\pi}_{m}^{{k}_{t}},{o}_{t})$ \State$t=t+n$ \EndWhile\State$\mathcal{E}=\mathcal{E}\cup \{({o}_{0},{k}_{1},{r}_{1},{o}_{1},{k}_{2},{r}_{2},{o}_{2},\mathrm{\dots})\}$ \EndFor\State*** Make a PPO step for the master policy on $\mathcal{E}$ ***
$\mu =\text{ppo\_update}({\pi}_{m},\mathcal{E})$ \EndWhile
IIIC Implementation details
Skill learning with BC. While training the skills, we predict several actions in the future to improve the policy performance and to stabilize the training. We learn to predict actions at times $t,t+10,t+20,t+30$ during training and use action at time $t$ to control the robot at test time. We normalize the ground truth of the expert actions to have zero mean and a unit variance and normalize the depth values of input frames to be in $[1,1]$. For the optimization, we use Adam [38] with the learning rate of ${10}^{3}$ and a batch size 64. In all cases we use Batch Normalization [39].
Task learning with RL. We use PPO [40] as an offtheshelf RL optimization algorithm for the master policy. We use an opensource implementation from [41] where we set the entropy coefficient to 0.05 and the value loss coefficient to 1. For the two environments considered in Section V, we use 8 episode rollouts for the PPO update. The rest of the PPO hyperparameters are set to the defaults of the opensource implementation. For the HRLBC method, the master and skill policies have CNN layers with parameters $\mu ,{\eta}^{1},\mathrm{\dots},{\eta}^{K}$ on top of the common $CNN(\theta )$. During pretraining of skill policies we update the parameters $\{\theta ,{\eta}^{1},\mathrm{\dots}{\eta}^{K}\}$. When training the master policy, we only update $\mu $ while keeping all other parameters fixed.
IV Evaluation of BC skill learning
IVA Experimental setup
Robot and agent environment. Existing environments used for evaluating learning methods in a robotics scenario [9] are limited. Here, we design a set of tasks with the pybullet physics simulator [42]. Our environment models a 6DoF UR5 robotic arm and a 3 finger Robotiq gripper. The agent observes the environment from a depth camera in front of the arm, takes as input the three last depth frames ${o}_{t}\in {\mathbb{R}}^{224\times 224\times 3}$ and commands the robot with an action ${a}_{t}\in {\mathbb{R}}^{7}$. The control is performed at a frequency of 10 Hz.
UR5 tasks. For the skill learning we rely on the tasks of picking up a cube (UR5Pick) and pouring from a bottle into a bowl (UR5Pour). UR5Pick task picks up a cube of a size between 3.5 cm and 8.0 cm and lifts it above 7 cm, see Figure 2(a). The maximum episode length is 200 steps. UR5Pour task has as goal to pour from a filled bottle into an empty bowl, see Figure 2(b). The maximum episode length is 400 steps. In each episode, the bottle and the bowl are randomly chosen from the ShapeNet dataset [43]. We use distinct object instances for the training and test sets.
Synthetic dataset. We use our simulator environment to create a synthetic training and test set. For all our experiments, we collect expert synthetic demonstrated trajectories with random initial configurations where the objects and the endeffector are allocated within a workspace of $80\times 40\times 20{\text{cm}}^{3}$. The synthetic demonstrations are collected using an expert script designed for each skill. The script has access to the full state of the system including the states of the robot and the objects. To generate synthetic demonstrations, we program endeffector trajectories and use inverse kinematics (IK) to generate corresponding trajectories in a robot joints space. For example, to pick up a cube, we draw a line from the robot initial position to a position above the cube, go down, close the gripper and finally go up. We collect up to $1000$ training demonstrations in our simulator. Each demonstration consists of multiple pairs of the three last camera observations and the robot control command performed by the expert script. For the experiments with the UR5 tasks, we use a reference viewpoint located at 140 cm of the robot base, with a $20\mathrm{\xb0}$ pitch angle.
Training and evaluation. For each dataset, we train the policy for 200 epochs. During training, we evaluate the policy every 4 epochs by running it on a set of 100 validation configurations and measure its success rate. We pick the policy with the best success rate on the validation set. For testing we use 50 new configurations and report the success rate of completing the task correctly for these 50 configurations.
IVB CNN architecture for BC skill learning
Given the synthetic training set of the UR5Pick task, see Figure 2(a) (left), we train policies with different CNN architectures. Table I compares the success rates of learned policies for different CNN architectures and varying number of expert demonstrations on the synthetic test set. For BC skill policies, we here directly predict the action from the output of $CNN(\theta )$.
Policies based on the VGG CNN architecture [44] obtain success rate below $40\%$ with $100$ training demonstrations and reach $95\%$ with $1000$ demonstrations.
ResNet [45] based policies have a success rate above $60\%$ when trained on a dataset of $100$ demonstrations and reach $100\%$ with $1000$ demonstrations. Overall ResNet101 has the best performance closely followed by ResNet18 and outperforms VGG significantly. To conclude, we find that the network architecture has a fundamental impact on the BC performance. In the following experiments we use ResNet18 as it presents a good tradeoff between performance and training time.
When examining why VGGbased BC has a lower success rate, we observe that it has higher validation errors compared to ResNet. This indicates that VGG performs worse on the level of individual steps and is hence expected to result in higher compounding errors. We find that architectures based on ResNet suffer from the same problem only when a small number of demonstrations is used (less than 100).
IVC Evaluation of data augmentation
We evaluate the impact of different types of data augmentations in Table II. We compare training without data augmentation with 3 variants: (1) random translations, rotations and crops (standard approach for object detection), (2) record each each expert synthetic demonstration from 10 varying viewpoints and (3) the combination of (1) and (2). We sample the camera positions on a section of a sphere centered on the robot and with a radius of $140\text{cm}$. We uniformly sample the yaw angle in $[15\mathrm{\xb0},15\mathrm{\xb0}]$, the pitch angle in $[15\mathrm{\xb0},30\mathrm{\xb0}]$, and the distance to the robot base in $[1.35,1.50]$ m.
Success rates for UR5Pick on datasets with 20, 50 and 100 demonstrations are reported in Table II. We observe that data augmentation is particularly important when only a few demonstrations are available. For 20 demonstrations, the policy trained with no augmentation performs at 1% while the policy trained with standard and viewpoint augmentations together performs at 75%. The standard data augmentation brings more improvement than the viewpoint
augmentation for a small numbers of demonstrations (20 and 50). However it does not reach 100% with 100 demonstrations in contrast to multiple viewpoints. The policy trained with a combination of both augmentation types performs the best and achieves 93% and 100% success rate for 50 and 100 demonstrations respectively. In summary, data augmentation allows a significant reduction in the number of expert trajectories required to solve the task.
IVD Evaluation of instance generalization
We evaluate how BC skills generalize from specific instances to object categories on the UR5Pour task shown in Figure 2 (left). In the following experiments we use ResNet18 trained with 200 synthetic demonstrations and the combination of standard and viewpoint data augmentations. The evaluation in our simulation environment compares the use of 1, 10 or 50 bottle instances during training. We use 50 different bottles to test the learned policies. All bottles are obtained from the ShapeNet dataset [43].
When trained with a single bottle, the policy achieves success rate of 46% and does not generalize to the test bottles well. If trained on 10 bottles the success rate is 84%. Further increasing the train set size to 50 bottles does not improve the results. This can be explained by the fact that the additional bottles do not add information, as they are similar to the initial set.
IVE Real robot experiments
We evaluate trained policies on a real robot, which has the same robotic arm and gripper as in simulation. We record depth images with Microsoft Kinect 2 placed as in simulation. In order to apply the method on the real robot, we use a stateoftheart technique of learning sim2real based on data augmentation with domain randomization [10]. This method uses a proxy task of cube position prediction and a set of basic image transformations to learn a sim2real data augmentation function for depth images. We augment the depth frames from synthetic expert demonstrations with this method and, then, train skill policies. Once the skill policy is trained on these augmented simulation images, it is directly used on the real robot.
We evaluate our method on UR5Pick and UR5Pour described in Section IVA, see Figure 2. We use 20 different initial configurations for the real world evaluation. We show that our approach when trained with sim2real data augmentation transfers well to the real robot. The learned policy manages to pick up cubes of sizes 3.5, 4.7 and 8 cm correctly for 20 out of 20 trials. The task pour bottle is slightly more difficult and 17 out of 20 trials are successful. For evaluation, we use 3 real bottles which shapes were not present in the simulation training set.
IVF Comparison with stateoftheart methods
One of the few testbeds for robot manipulation is FetchPickPlace from OpenAI Gym [9] implemented in mujoco [46], see Figure 3(a). The goal for the agent is to pick up a cube and to move it to the red target. The agent observes the three last RGBD images from a camera placed in front of the robot ${o}_{t}\in {\mathbb{R}}^{100\times 100\times 4\times 3}$. The positions of the cube and the target are set at random for each trial. The reward of the task is a single sparse reward if the cube is within an $\u03f5$ ball of the target (depicted in red in Figure 3(a)). The maximum length of the task is set to 50 timesteps.
For a fair comparison with [3], we follow the same procedure for the dataset recording and do not use any data augmentation. Instead, we record each synthetic demonstration from a viewpoint picked at random using the same sampling space described in Section IVC. We train a ResNet18 policy using from $10$ to ${10}^{4}$ expert demonstrations. The results are reported in Figure 3(b) where we test our approach on 200 different initial configurations. We follow [3] and plot the success rate of both RL and IL methods with respect to the number of episodes used (either trials or synthetic demonstrations).
The success rate indicates that we outperform the policies trained with an imitation learning method DAgger [2] in terms of performance and RL methods such as HER [47] and DDPG [48] in terms of dataefficiency. According to the reported results, DAgger does not reach 100% even after $8*{10}^{4}$ demonstrations while, unlike our method, requires an expert during training. HER reaches the success rate of 100% but requires about $4*{10}^{4}$ episodes. Our approach achieves the 96% success rate using ${10}^{4}$ demonstrations.
V Evaluation of HRLBC
This section evaluates the proposed HRLBC approach of learning an RL master policy on top of pretrained BC skills. We evaluate the method on two tasks described in Section VA: UR5Bowl and UR5Breakfast illustrated in Figure 4. Both tasks are long and are trained with a single sparse reward of success. Section VB compares our HRLBC approach with a strong "BCordered" baseline where the skills are executed in a fixed and predefined order. We report results for simulated and real versions of both tasks. Note, that our real robot experiments are performed with skills and master policies that have been trained exclusively in simulation.
VA Experimental setup
HRLBC tasks. To evaluate HRLBC we consider two UR5 tasks. The UR5Bowl task has as goal to grasp the cube and to place it in the bowl as shown in Figure 4(a). The maximum episode length is 600. The UR5Breakfast task contains a cup, a bottle and a bowl as shown in Figure 4(b). The agent needs to pour ingredients from the cup and the bottle in the bowl; the reward is positive if and only if all ingredients are in the bowl. The maximum episode length is 2000.
Skills and datasets. For each task we consider a set of skills defined by expert scripts. For the UR5Bowl task, we define four skills: (a) go to the the cube, (b) go down and grasp, (c) go up, and (d) go to the bowl and open the gripper. To train BC skill policies, we collect a dataset of 250 synthetic demonstrated trajectories for each skill recorded from 5 viewpoints.
For the UR5Breakfast task, we define four skills: (a) go to the bottle, (b) go to the cup, (c) grasp an object and pour it to the bowl, and (d) release the held object. We collect a dataset with 250 demonstrations containing cup pouring, bottle pouring and object placement in random locations. Synthetic demonstrations are rendered in 5 viewpoints and contain different bottles and cups from ShapeNet. We emphasize that the expert dataset does not contain full task demonstrations and that all our training is done in simulation.
Training. Similar to Section IV, we learn BC skills to mimic synthetic demonstrations generated by the expert script. Unlike BC skill learning in Section IV, we here train all skills of the same task simultaneously using the multihead CNN architecture illustrated in Figure 1(right). We represent each head by a convolutional block from ResNet18. Each head output is then averaged with an average pooling and used to predict the skill action using a linear layer. When training the RL master, we execute selected skills for 60 consecutive timesteps for the UR5Bowl task and 220 timesteps for the UR5Breakfast task. The master policy keeps switching the skills until either the maximum length of the episode or the success condition is reached. We train HRLBC using 8 different random seeds in parallel. The rest of RL hyperparameters are described in Section IIIC. During the RL and BC skill training we apply sim2real data augmentation [10] to enable successful execution of policies on a real robot.


VB Results
UR5Bowl results.
We evaluate HRLBC and compare it to a strong baseline with a fixed sequence of BC skills following the manually predefined correct order (BCordered). Our results for the simulated "UR5Bowl (sim)" and real "UR5Bowl (real)" environments are reported in Table III. While tested in simulation, both HRLBC and the BCordered put the cube into the bowl with a success rate of 96%. HRLBC manages to learn the sequence of skills given only sparse rewards.
We have also attempted to solve the UR5Bowl task without skills by learning an RL policy performing lowlevel control. We initialized ResNet18 on ImageNet, froze it and trained the same architecture used for the HRLBC master policy
Task  BCordered  HRLBC 
UR5Bowl (sim)  96%  96% 
UR5Bowl (real)  85%  90% 
UR5BreakfastSimple (sim)  90%  90% 
UR5BreakfastSimple (real)  80%  80% 
UR5BreakfastHard (sim)  42%  60% 
UR5BreakfastHard (real)  45%  80% 
(3 convolutional layers with 64 filters of size $3\times 3$) on top of the visual representation with PPO. Whereas PPO did not solve the task a single time after ${10}^{4}$ episodes, HRLBC reaches 96% after 400 episodes.
We evaluate UR5Bowl on the real robot as illustrated in Figure 4(a)(right). While the ordered skills solve the task with a success rate 17/20, HRLBC succeeds in 18 out of 20 episodes. While the ordered skills use a fixed number of timesteps to execute actions, the master policy can determine how close the cube is to the gripper and can recover from failures.
UR5Breakfast results.
We evaluate HRLBC on the challenging UR5Breakfast task both in simulation and on the real robot. Results are reported in Table III. In particular, we define two setups with different distances between initial positions of the objects: UR5BreakfastSimple and UR5BreakfastHard. In the hard setup, the bottle and the cup are placed close to each other and therefore the order of grasp becomes very important. Attempts to grasp wrong objects in this setup result in failures due to collisions of the robot with another object. While training HRLBC, we use the same BCskills for both setup. When tested in simulation, HRLBC performs similar to BCordered on UR5BreakfastSimple (90% of successful trials) but outperforms the manual ordering on UR5BreakfastHard (60% for HRLBC vs. 42% for BCordered). While the ordered skills always grasp the predefined object, the HRLBC can choose an object it should grasp first which is especially important on UR5BreakfastHard. In the real world evaluation, both HRLBC and ordered skills succeeded in 16 out of 20 episodes on the simple setup. However, on the hard setup the performance of ordered skills drops to 45% due to collisions. HRLBC succeeds in 60% of trials by choosing the appropriate object to avoid collisions. Some qualitative results for HRLBC are shown in Figure 4. The appendix presents more qualitative results. In particular, we demonstrate successful execution of HRLBC while the robot is facing the challenges of previously unseen objects, dynamic scene changes and occlusions.
VI Conclusion
This paper introduces an approach for hierarchical RL with skills learned from synthetic demonstrated trajectories. Our method is well adapted for composite problems and requires no fulltask demonstrations nor intermediate rewards. We show excellent results in simulation and on a real robot. Given our sampleefficient strategy for learning primitive skills, the proposed method should generalize to a variety of new tasks. Future work includes learning multiple tasks with shared skills, determining new skills automatically and addressing contactrich tasks.
References
 [1] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in NIPS, 1989.
 [2] S. Ross and J. A. Bagnell, “Reinforcement and imitation learning via interactive noregret learning,” in arXiv, 2014. [Online]. Available: http://arxiv.org/abs/1406.5979
 [3] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric Actor Critic for ImageBased Robot Learning,” RSS, 2018.
 [4] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in ICML, 2000.
 [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, 2015. [Online]. Available: http://www.nature.com/doifinder/10.1038/nature14236
 [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016.
 [7] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular control for embodied question answering,” in CoRL, 2018.
 [8] H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. D. III, “Hierarchical imitation and reinforcement learning,” in ICML, 2018.
 [9] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multigoal reinforcement learning: Challenging robotics environments and request for research,” arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1802.09464
 [10] A. Pashevich, R. A. M. Strudel, I. Kalevatykh, I. Laptev, and C. Schmid, “Learning to augment synthetic images for sim2real policy transfer,” IROS, 2019.
 [11] T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching and grasping skills using neural reinforcement learning,” IJCNN, 2013.
 [12] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation,” in ICML, 2016. [Online]. Available: http://arxiv.org/abs/1610.00633
 [13] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” JMLR, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
 [14] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. BarthMaron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Dataefficient Deep Reinforcement Learning for Dexterous Manipulation,” arXiv, 2017. [Online]. Available: http://arxiv.org/abs/1704.03073
 [15] Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “Oneshot imitation learning,” NIPS, 2017.
 [16] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing  Solving sparse reward tasks from scratch,” MLR, 2018. [Online]. Available: http://arxiv.org/abs/1802.10567
 [17] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in NIPS, 2016.
 [18] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation,” arXiv, 2016. [Online]. Available: http://arxiv.org/abs/1611.05095
 [19] M. Laskey, J. Lee, R. Fox, A. D. Dragan, and K. Goldberg, “DART: Noise injection for robust imitation learning,” in CoRL, 2017. [Online]. Available: http://proceedings.mlr.press/v78/laskey17a.html
 [20] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” IJRR, vol. 32, no. 11, 2013.
 [21] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 [22] P. Dayan and G. Hinton, “Feudal Reinforcement Learning,” in NIPS, 1993.
 [23] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman, “Meta learning shared hierarchies,” ICLR, 2018. [Online]. Available: http://arxiv.org/abs/1710.09767
 [24] Y. Lee, S.H. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim, “Composing complex skills by learning transition policies with proximity reward induction,” in ICLR, 2019.
 [25] P.L. Bacon, J. Harb, and D. Precup, “The optioncritic architecture.” in AAAI, 2017.
 [26] C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” ICLR, 2017. [Online]. Available: http://arxiv.org/abs/1704.03012
 [27] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent space policies for hierarchical reinforcement learning,” ICML, 2018. [Online]. Available: http://arxiv.org/abs/1804.02808
 [28] O. Nachum, S. Gu, H. Lee, and S. Levine, “Dataefficient hierarchical reinforcement learning,” NIPS, 2018. [Online]. Available: http://arxiv.org/abs/1805.08296
 [29] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in ICML, 2017. [Online]. Available: http://arxiv.org/abs/1703.01161
 [30] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in NIPS, 2016.
 [31] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an embedding space for transferable robot skills,” in ICLR, 2018. [Online]. Available: https://openreview.net/forum?id=rk07ZXZRb
 [32] Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcement learning from imperfect demonstrations,” ICLR workshop, 2018. [Online]. Available: https://openreview.net/forum?id=BJJ9bz0
 [33] C.A. Cheng, X. Yan, N. Wagener, and B. Boots, “Fast policy learning through imitation and reinforcement,” UAI, 2018.
 [34] W. Sun, J. A. Bagnell, and B. Boots, “Truncated horizon policy search: Combining reinforcement learning & imitation learning,” in ICLR, 2018.
 [35] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. DulacArnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep QLearning from demonstrations,” AAAI, 2018.
 [36] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in ICRA, 2018. [Online]. Available: https://doi.org/10.1109/ICRA.2018.8463162
 [37] Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess, “Reinforcement and imitation learning for diverse visuomotor skills,” arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1802.09564
 [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
 [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
 [40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
 [41] I. Kostrikov, “Pytorch implementations of reinforcement learning algorithms,” https://github.com/ikostrikov/pytorcha2cppoacktr, 2018.
 [42] E. Courmans and Y. Bai, “PyBullet, Python module for physics simulation, robotics and machine learning,” http://pybullet.org/, 20162017.
 [43] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An informationrich 3D model repository,” arXiv, 2015. [Online]. Available: http://arxiv.org/abs/1512.03012
 [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
 [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CVPR, 2016.
 [46] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for modelbased control,” in IROS, 2012. [Online]. Available: http://ieeexplore.ieee.org/document/6386109/
 [47] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in NIPS, 2017.
 [48] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” ICLR, 2016. [Online]. Available: http://arxiv.org/abs/1509.02971
VII Appendix
We present additional qualitative results for the HRLBC approach on the real robot with policies that have been trained in simulation, as in the main paper. Section VIIA describes and illustrates examples of UR5Bowl and UR5Breakfast policies. We demonstrate robot behavior while the robot is facing the challenges of previously unseen objects, dynamic changes of object locations and occlusions. We also illustrate feature map activations of the network providing better understanding of learned policies in Section VIIB.
VIIA Qualitative results
UR5Bowl: Multiple objects.
We experiment with the HRLBC policy trained in the UR5Bowl environment. Once the robot succeeds to place a cube in the bowl, we put another cube on the table and let the policy continue, see Figure 5. While the UR5Bowl policy has been trained to handle one cube only, it automatically generalizes to multiple cubes when run in a loop.
UR5Bowl: Previously unseen objects.
We further test the HRLBC UR5Bowl policy in the presence of previously unseen objects. While the policy has been trained to manipulate cubes of different sizes (see Section 5.1), we observe its robustness to other object shapes. As shown in Figure 6, the policy successfully grasps and places into a bowl real objects, such as apples, oranges, lemons, and toys. Notably, in cases of failing to grasps an object, the robot automatically recovers and completes the task. This behavior comes naturally from our HRLBC master policy that has learned to adapt the sequence of skills given current observations of the scene.
UR5Breakfast: New object instances.
To enable generalization of learned policies to new object instances, our UR5Breakfast environment contains cups, bottles and bowls of different shapes from ShapeNet (see Section 5.1). During testing we run the learned HRLBC UR5Breakfast policy on a real robot and experiment with instances of bottles and cups unseen during training. Figure 7 demonstrates successful executions of the HRLBC UR5Breakfast policy in scenes with significant variations in object shapes, for example, using a wine glass instead of a cup.
UR5Breakfast: Dynamic changes of object location.
Our BC skills make decisions at every timestep and, hence, can instantly adapt to changing conditions of the scene. We verify this by varying object positions during grasping attempts of the HRLBC UR5Breakfast policy. Figure 8 illustrates the reactive behavior of the robot grasping a cup that is being simultaneously moved by the person. The cup is successfully grasped after multiple changes of its position.
UR5Breakfast: Occlusion.
Another example of the instant replanning by our HRLBC policy is demonstrated in Figure 9. While the robot approaches an object, we temporary occlude the object and disrupt the executing BC skill. Given the hierarchical nature of our HRLBC approach, the RL master is able to recover from this failure by starting another skill that leads to the completion of the task. More precisely, when the bottle gets occluded in the example of Figure 9, the robot changes its strategy and decides to grasp and pour from a cup. Once the occlusion is removed, the robot automatically resumes and completes the task by grasping and pouring from the bottle.
UR5Breakfast: Failure case for BCordered.
Finally, we demonstrate the advantage of HRLBC compared to the "BCordered" baseline with a fixed order of skills. For the UR5Breakfast environment we define the order of skills for the BCordered policy as 1. "go to the cup", 2. "grasp the object and pour it to the bowl", 3. "release the object", 4. "go to the bottle", 5. "grasp the object and pour it to the bowl", 6. "release the object". While this policy succeeds in many cases, the near placement of objects presents a source of problems. Figure 10 shows an example scene with a cup and a bottle being near to each other while the cup is placed in front of the bottle. As the BCordered policy above is preprogrammed to grasp the cup first, the execution of this policy results in a collision between a gripper and a bottle, followed by the failure of the task. Notably, our HRLBC policy learns to select the order of objects for grasping such as to avoid failures of the task. Hence, the HRLBC automatically learns to avoid collisions without the need of specific intermediate rewards. This advantage of HRLBC is demonstrated by quantitative results in Table 3.
VIIB Feature map activations
The HRLBC policy uses no explicit representation of scenes, for example in terms of categories and locations of objects. Some interpretation of learned policies, however, can be obtained by examining spatial activations of the neural network at intermediate network layers. Figure 11 shows silency maps of an HRLBC policy highlighting which parts of the image the agents concentrates on. The silency maps are computed as activations of convolutional feature maps obtained from last layers of $CNN(\theta )$ and $CNN({\eta}^{i}),i=1,\mathrm{\dots},K$ (see Figure 1), averaged over all channels and layers. The resulting heatmaps are shown for different stages of the UR5Breakfast task. Interestingly, while grasping and moving the bottle, the network generates highest activations around the bottle and the gripper, while ignoring other objects. When releasing the bottle, however, high activations are also observed around the cup. The attention to the cup might be explained by the need of avoiding collisions when placing the bottle on the table. Note, that we provide no intermediate rewards to RL, however, RL learns to avoid collisions since collisions imply failures of the final task and, hence, no final positive reward during training. Once the bottle is placed on the table, activations become low for the bottle, while the heatmap obtains maximum values for the manipulated cup. Observations of such feature maps have been useful in our work to identify certain cases of failures. We believe feature map activations are a useful tool to interpret learned policies.
grasping bottle 
moving bottle 
pouring from bottle 
releasing bottle 
grasping cup 
pouring from cup 