Combining learned skills and reinforcement learning for robotic manipulations

  • 2019-08-02 07:04:17
  • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid
  • 2


Manipulation tasks such as preparing a meal or assembling furniture remainhighly challenging for robotics and vision. The supervised approach ofimitation learning can handle short tasks but suffers from compounding errorsand the need of many demonstrations for longer and more complex tasks.Reinforcement learning (RL) can find solutions beyond demonstrations butrequires tedious and task-specific reward engineering for multi-step problems.In this work we address the difficulties of both methods and explore theircombination. To this end, we propose a RL policies operating on pre-trainedskills, that can learn composite manipulations using no intermediate rewardsand no demonstrations of full tasks. We also propose an efficient training ofbasic skills from few synthetic demonstrated trajectories by exploring recentCNN architectures and data augmentation. We show successful learning ofpolicies for composite manipulation tasks such as making a simple breakfast.Notably, our method achieves high success rates on a real robot, while usingsynthetic training data only.


Quick Read (beta)

Combining learned skills and reinforcement learning
for robotic manipulations

Robin Strudel*1, Alexander Pashevich*2, Igor Kalevatykh1,
Ivan Laptev1, Josef Sivic1, Cordelia Schmid2

Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. The supervised approach of imitation learning can handle short tasks but suffers from compounding errors and the need of many demonstrations for longer and more complex tasks. Reinforcement learning (RL) can find solutions beyond demonstrations but requires tedious and task-specific reward engineering for multi-step problems. In this work we address the difficulties of both methods and explore their combination. To this end, we propose a RL policies operating on pre-trained skills, that can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. We also propose an efficient training of basic skills from few synthetic demonstrated trajectories by exploring recent CNN architectures and data augmentation. We show successful learning of policies for composite manipulation tasks such as making a simple breakfast. Notably, our method achieves high success rates on a real robot, while using synthetic training data only.


Op r() #1(∣) \NewDocumentCommand\p r() \pr[p](#1) \NewDocumentCommand\q r() \pr[q](#1) \NewDocumentCommand\Normal r() \pr[Normal](#1) \NewDocumentCommand\Cat r() \pr[Cat](#1) \NewDocumentCommand\Bin r() \pr[Bin](#1) \NewDocumentCommand\Beta r() \pr[Beta](#1) \NewDocumentCommand\Bernoulli r() \pr[Bernoulli](#1) \NewDocumentCommand\Dir r() \pr[Dir](#1)

11footnotetext: Inria, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France.22footnotetext: University Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.11footnotetext: Equal contribution.

I Introduction

In this work we consider visually guided robotics manipulations and aim to learn visuomotor control policies to solve particular tasks. One approach to address this problem is imitation learning (IL) [1, 2, 3, 4] that aims to mimic sequences of actions from expert demonstrations. Such a supervised approach is efficient for learning short and simple tasks of limited variability. One drawback of IL is its difficulty to handle new states that have not been observed during demonstrations. While increasing the number of demonstrations helps to alleviate this issue, an exhaustive sampling of action sequences and scenarios becomes impractical for long and complex tasks.

Reinforcement learning (RL) is a complementary approach to IL that requires little supervision and achieves excellent results for some challenging tasks [5, 6]. RL explores previously unseen scenarios and, hence, can generalize beyond expert demonstrations. As full exploration is exponentially hard and becomes impractical for problems with long horizons, RL often relies on careful engineering of intermediate reward functions designed for specific tasks.

Common tasks such as preparing food or assembling furniture require long sequences of steps composed of many different actions. Such tasks have long horizons and, hence, are difficult to solve both with RL and IL methods. To address this issue, we propose a hierarchy of RL and imitation-based skills. Our approach aims to simplify RL by reducing its exploration to sequences with a limited number of primitive actions, that we call skills.

Given a set of pre-trained skills such as "grasp a cube" or "pour from a cup", we train RL with sparse binary rewards corresponding to the correct/incorrect execution of the full task. While hierarchical policies have been proposed in the past [7, 8], our approach can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. Given such properties, the proposed method can be directly applied to learn new tasks. See Figure 1 for the overview of our approach.

Our skills are low-level visuomotor controllers learned from synthetic demonstrated trajectories with behavioral cloning (BC) [1]. Examples of skills include go to the bowl, grasp the object, pour from the held object, release the held object, etc.

We automatically generate expert synthetic demonstrations and learn corresponding skills in simulated environments. We also minimize the number of required demonstrations by choosing appropriate CNN architectures and data augmentation methods. Our approach is shown to compare favorably to the state of the art [3] on the FetchPickPlace test environment [9]. Moreover, using recent techniques for domain adaptation [10] we demonstrate high success rates for task execution on a real robot while training all policies in a simulator.

In summary, this work makes the following contributions. (i) We propose a new hierarchical combination of RL policies and IL skills to address composite tasks. (ii) We present sample efficient training of BC skills and demonstrate an improvement compared to the state of the art. (iii) We demonstrate successful learning of relatively complex manipulation tasks without neither intermediate rewards nor full demonstrations. (iv) We execute policies learned in simulation to solve multi-step tasks successfully on a real robot. Our simulation environments together with the code and models used in this work will become publicly available.

Fig. 1: Illustration of our approach. (Left): Temporal hierarchy of master and skill policies. The master policy πm is executed at a coarse interval of n time-steps to select among K skill policies πs1πsK. Each skill policy generates control for a primitive action such as grasping or pouring. (Right): CNN architecture used for the skill and master policies.

II Related work

Our work is related to robotics manipulation such as grasping [11], opening doors [12], screwing the cap of a bottle [13] and cube stacking [14]. Such tasks have been addressed by various methods including imitation learning (IL) [15] and reinforcement learning (RL) [16].

Imitation Learning learns to solve a task by observing demonstrations. Approaches include behavioral cloning (BC) [1] and inverse reinforcement learning [4]. BC learns a function that maps states to expert actions [2, 3], whereas inverse reinforcement learning learns a reward function from demonstrations in order to solve the task with RL [17, 18, 14]. BC typically requires a large number of demonstrations and has issues with solving long-horizon problems. While these problems might be solved with additional expert supervision [2] or noise injection in expert demonstrations [19], we address them by improving the standard BC framework. We use recent state-of-the-art CNN architectures and data augmentation for expert trajectories. Combining these allows to significantly reduce the number of required demonstrations and to improve performance.

Reinforcement Learning learns to solve a tasks without demonstrations using exploration. Despite impressive results in several domains [6, 5, 20, 12], RL methods show limited capabilities when operating in long-horizon and sparse-reward environments common in robotics. Moreover, RL methods typically require prohibitively large amounts of interactions with the environment during training. Hierarchical RL (HRL) methods alleviate some of these problems by learning a high-level policy modulating low-level workers. HRL approaches are generally based either on options [21] or a feudal framework [22]. The option methods learn a master policy that switches between separate skill policies [23, 24, 25, 26]. The feudal approaches learn a master policy that modulates a low-level policy by a control signal [27, 28, 29, 30, 31]. Our approach is based on options but in contrast to the cited methods, we pretrain the skills with IL. This allows us to solve long-horizon and sparse reward problems using significantly less interactions with the environment during training.

Combinations of RL and IL have been introduced recently. Gao et al.[32] use demonstrations to initialize the RL agent. [33, 34] use RL to improve expert demonstrations, but do not learn hierarchical policies. Demonstrations have been also used to define RL objective functions [35, 36] and rewards [37]. Das et al. [7] combines IL and RL to learn a hierarchical policy. Unlike our method, however, [7] requires full task demonstrations and task-specific reward engineering. Moreover, the addressed navigation problem in [7] has a much lower time horizon compared to our tasks. [7] also relies on pre-trained CNN representations which limits its application domain. Le at al. [8] train low-level skills with RL, while using demonstrations to switch between skills. In a reverse manner, we use IL to learn low-level control and then deploy RL to find appropriate sequences of pre-trained skills. The advantage is that our method can learn a variety of long-horizon manipulations without full task demonstrations. Moreover, [7, 8] learn discrete actions and cannot be directly applied to robotics manipulations that require continous control.

In summary, none of the methods [7, 8, 33, 34] is directly suitable for learning long-horizon robotic manipulations due to requirements of dense rewards [7, 34] and state inputs [33, 34], limitations to short horizons and discrete actions [7, 8], the requirement of full task demonstrations [7, 8, 33, 34] and the lack of learning of visual representations [7, 33, 34]. Moreover, our skills learned from synthetic demonstrated trajectories outperform RL based methods, see Section IV-F.

III Approach

Our HRL-BC approach aims to learn multi-step policies by combining hierarchical reinforcement learning (HRL) and pre-trained skills obtained with behavioral cloning (BC). We present BC and HRL-BC in Sections III-A and III-B. Implementation details are given in Section III-C.

III-A Skill learning with behavioral cloning

Our first goal is to learn basic skills that can be composed into more complex policies. Given observation-action pairs 𝒟={(ot,at)} along expert trajectories, we follow the behavioral cloning approach  [1] and learn a function approximating the conditional distribution of the expert policy πE(at|ot) controlling a robot arm. Our observations ot𝒪=H×W×M are sequences of the last M depth frames. Actions at=(𝐯t,𝝎t,gt), at𝒜BC are defined by the end-effector linear velocity 𝐯t3 and angular velocity 𝝎t3 as well as the gripper openness state gt{0,1}.

We learn the deterministic skill policies πs:𝒪𝒜BC approximating the expert policy πE. Given observations ot with corresponding expert (ground truth) actions at=(𝐯t,𝝎t,gt), we represent πs with a convolutional neural network (CNN) and learn network parameters θ such that predicted actions πs(ot)=(𝐯^t,𝝎^t,gt^) minimize the loss

LBC(πs(ot),at)=λ [𝐯^𝐭,𝝎^t]-[𝐯t,𝝎t]22+ (1)
(1-λ) (gtloggt^+(1-gt)log(1-gt^))

where λ[0,1] is a scaling factor which we set to 0.9 (chosen empirically).

Our network architecture is presented in Figure 1(right). When training several skill policies πs1,,πsK such as reaching, grasping or pouring, we share parameters θ of the CNN(θ) and add a separate branch with convolutional layers CNN(ηi) for each skill i. We use the same architecture to learn the master policy πm as explained in Section III-B.

III-B HRL-BC approach

We wish to solve composite manipulations without full expert demonstrations and with a single sparse reward. For this purpose we design high-level master policies πm controlling the pre-trained skill policies πs at a coarse timescale. To learn πm, we follow the standard formulation of reinforcement learning and maximize the expected return 𝔼πk=0γkrt+k given rewards rt. Our reward function is sparse and returns 1 upon successful termination of the task and 0 otherwise. The RL master policy πm:𝒪×𝒜RL[0,1] chooses one of the K skill policies to execute the low-level control, i.e., the action space of πm is discrete: 𝒜RL={1,,K}. Note, that our sparse reward function makes the learning of deep visual representations challenging. We therefore train πm using visual features obtained from the BC pre-trained CNN(θ) as illustrated in Figure 1(right).

To solve composite tasks with sparse rewards, we use a coarse timescale for the master policy. The selected skill policy controls the robot for n consecutive time-steps before the master policy is activated again to choose a new skill. This allows the master to focus on high-level decisions rather than low-level control. At the same time, we expect the master policy to recover from unexpected events, for example, if an object slips out of a gripper, by activating an appropriate skill policy. Our hierarchical combination of the master and skill policies is illustrated in Figure 1(left).

HRL-BC algorithm..

The pseudo-code of the proposed approach is shown in Algorithm III-B. The algorithm can be divided into three main steps. First, we collect a dataset of expert trajectories 𝒟k for each skill policy πsk. For each policy, we use an expert script that has an access to the full state of the environment. Next, we train a set of skill policies {πs1,,πsK}. We sample a batch of state-action pairs and update parameters of convolutional layers θ as well as skill-specific parameters ηi for skills i=1,,K. Finally, we learn the master πμ using the pretrained skill policies and the frozen parameters θ,η. We collect episode rollouts by first choosing a skill policy with the master and then applying the selected skill to the environment for n time-steps (lines 13-14). We update the master policy weights μ by maximizing the expected sum of rewards.


HRL-BC {algorithmic}[1] \State*** Collect expert data *** \Fork{1,,K} \StateCollect an expert dataset 𝒟k for the skill policy πsk \EndFor\State*** Train {πs1,,πsK} by solving: *** \Stateθ,η=argminθ,ηk=1K(ot,at)𝒟kLBC(πsk(ot),at) \Whiletask is not solved \State*** Collect data for the master policy *** \State={} \CommentEmpty storage for rollouts \Forepisode_id{1,,ppo_num_episodes} \Stateo0=new_episode_observation() \Statet=0 \Whileepisode is not terminated \Statektπm(ot) \CommentChoose the skill policy \Stateot+n,rt+n=perform_skill(πmkt,ot) \Statet=t+n \EndWhile\State={(o0,k1,r1,o1,k2,r2,o2,)} \EndFor\State*** Make a PPO step for the master policy on ***


μ=ppo_update(πm,) \EndWhile

III-C Implementation details

Skill learning with BC. While training the skills, we predict several actions in the future to improve the policy performance and to stabilize the training. We learn to predict actions at times t,t+10,t+20,t+30 during training and use action at time t to control the robot at test time. We normalize the ground truth of the expert actions to have zero mean and a unit variance and normalize the depth values of input frames to be in [-1,1]. For the optimization, we use Adam [38] with the learning rate of 10-3 and a batch size 64. In all cases we use Batch Normalization [39].

Task learning with RL. We use PPO [40] as an off-the-shelf RL optimization algorithm for the master policy. We use an open-source implementation from [41] where we set the entropy coefficient to 0.05 and the value loss coefficient to 1. For the two environments considered in Section V, we use 8 episode rollouts for the PPO update. The rest of the PPO hyperparameters are set to the defaults of the open-source implementation. For the HRL-BC method, the master and skill policies have CNN layers with parameters μ,η1,,ηK on top of the common CNN(θ). During pre-training of skill policies we update the parameters {θ,η1,ηK}. When training the master policy, we only update μ while keeping all other parameters fixed.

IV Evaluation of BC skill learning

IV-A Experimental setup

Robot and agent environment. Existing environments used for evaluating learning methods in a robotics scenario [9] are limited. Here, we design a set of tasks with the pybullet physics simulator [42]. Our environment models a 6-DoF UR5 robotic arm and a 3 finger Robotiq gripper. The agent observes the environment from a depth camera in front of the arm, takes as input the three last depth frames ot224×224×3 and commands the robot with an action at7. The control is performed at a frequency of 10 Hz.

(a) UR5-Pick

(b) UR5-Pour

Fig. 2: UR5 environments used for skill learning. (Left) simulation, (right) real robot.

UR5 tasks. For the skill learning we rely on the tasks of picking up a cube (UR5-Pick) and pouring from a bottle into a bowl (UR5-Pour). UR5-Pick task picks up a cube of a size between 3.5 cm and 8.0 cm and lifts it above 7 cm, see Figure 2(a). The maximum episode length is 200 steps. UR5-Pour task has as goal to pour from a filled bottle into an empty bowl, see Figure 2(b). The maximum episode length is 400 steps. In each episode, the bottle and the bowl are randomly chosen from the ShapeNet dataset [43]. We use distinct object instances for the training and test sets.

Synthetic dataset. We use our simulator environment to create a synthetic training and test set. For all our experiments, we collect expert synthetic demonstrated trajectories with random initial configurations where the objects and the end-effector are allocated within a workspace of 80×40×20cm3. The synthetic demonstrations are collected using an expert script designed for each skill. The script has access to the full state of the system including the states of the robot and the objects. To generate synthetic demonstrations, we program end-effector trajectories and use inverse kinematics (IK) to generate corresponding trajectories in a robot joints space. For example, to pick up a cube, we draw a line from the robot initial position to a position above the cube, go down, close the gripper and finally go up. We collect up to 1000 training demonstrations in our simulator. Each demonstration consists of multiple pairs of the three last camera observations and the robot control command performed by the expert script. For the experiments with the UR5 tasks, we use a reference viewpoint located at 140 cm of the robot base, with a 20° pitch angle.

Training and evaluation. For each dataset, we train the policy for 200 epochs. During training, we evaluate the policy every 4 epochs by running it on a set of 100 validation configurations and measure its success rate. We pick the policy with the best success rate on the validation set. For testing we use 50 new configurations and report the success rate of completing the task correctly for these 50 configurations.

IV-B CNN architecture for BC skill learning

Given the synthetic training set of the UR5-Pick task, see Figure 2(a) (left), we train policies with different CNN architectures. Table I compares the success rates of learned policies for different CNN architectures and varying number of expert demonstrations on the synthetic test set. For BC skill policies, we here directly predict the action from the output of CNN(θ).

Policies based on the VGG CNN architecture [44] obtain success rate below 40% with 100 training demonstrations and reach 95% with 1000 demonstrations.

Demos VGG16-BN ResNet-18 ResNet-101 20 1% 1% 0% 50 9% 5% 5% 100 37% 65% 86% 1000 95% 100% 100%

TABLE I: Evaluation of BC skills trained with different CNN architectures and number of demonstrations on the UR5-Pick task in simulation.

ResNet [45] based policies have a success rate above 60% when trained on a dataset of 100 demonstrations and reach 100% with 1000 demonstrations. Overall ResNet-101 has the best performance closely followed by ResNet-18 and outperforms VGG significantly. To conclude, we find that the network architecture has a fundamental impact on the BC performance. In the following experiments we use ResNet-18 as it presents a good trade-off between performance and training time.

When examining why VGG-based BC has a lower success rate, we observe that it has higher validation errors compared to ResNet. This indicates that VGG performs worse on the level of individual steps and is hence expected to result in higher compounding errors. We find that architectures based on ResNet suffer from the same problem only when a small number of demonstrations is used (less than 100).

IV-C Evaluation of data augmentation

We evaluate the impact of different types of data augmentations in Table II. We compare training without data augmentation with 3 variants: (1) random translations, rotations and crops (standard approach for object detection), (2) record each each expert synthetic demonstration from 10 varying viewpoints and (3) the combination of (1) and (2). We sample the camera positions on a section of a sphere centered on the robot and with a radius of 140cm. We uniformly sample the yaw angle in [-15°,15°], the pitch angle in [15°,30°], and the distance to the robot base in [1.35,1.50] m.

Success rates for UR5-Pick on datasets with 20, 50 and 100 demonstrations are reported in Table II. We observe that data augmentation is particularly important when only a few demonstrations are available. For 20 demonstrations, the policy trained with no augmentation performs at 1% while the policy trained with standard and viewpoint augmentations together performs at 75%. The standard data augmentation brings more improvement than the viewpoint

 Demos None Standard Viewpoint Standard & Viewpoint 20 1% 49% 39% 75% 50 5% 81% 79% 93% 100 65% 97% 100% 100%

TABLE II: Evaluation of ResNet-18 BC skills trained with different data augmentations on UR5-Pick task in simulation.

augmentation for a small numbers of demonstrations (20 and 50). However it does not reach 100% with 100 demonstrations in contrast to multiple viewpoints. The policy trained with a combination of both augmentation types performs the best and achieves 93% and 100% success rate for 50 and 100 demonstrations respectively. In summary, data augmentation allows a significant reduction in the number of expert trajectories required to solve the task.

IV-D Evaluation of instance generalization

We evaluate how BC skills generalize from specific instances to object categories on the UR5-Pour task shown in Figure 2 (left). In the following experiments we use ResNet-18 trained with 200 synthetic demonstrations and the combination of standard and viewpoint data augmentations. The evaluation in our simulation environment compares the use of 1, 10 or 50 bottle instances during training. We use 50 different bottles to test the learned policies. All bottles are obtained from the ShapeNet dataset [43].

When trained with a single bottle, the policy achieves success rate of 46% and does not generalize to the test bottles well. If trained on 10 bottles the success rate is 84%. Further increasing the train set size to 50 bottles does not improve the results. This can be explained by the fact that the additional bottles do not add information, as they are similar to the initial set.

IV-E Real robot experiments

We evaluate trained policies on a real robot, which has the same robotic arm and gripper as in simulation. We record depth images with Microsoft Kinect 2 placed as in simulation. In order to apply the method on the real robot, we use a state-of-the-art technique of learning sim2real based on data augmentation with domain randomization [10]. This method uses a proxy task of cube position prediction and a set of basic image transformations to learn a sim2real data augmentation function for depth images. We augment the depth frames from synthetic expert demonstrations with this method and, then, train skill policies. Once the skill policy is trained on these augmented simulation images, it is directly used on the real robot.

We evaluate our method on UR5-Pick and UR5-Pour described in Section IV-A, see Figure 2. We use 20 different initial configurations for the real world evaluation. We show that our approach when trained with sim2real data augmentation transfers well to the real robot. The learned policy manages to pick up cubes of sizes 3.5, 4.7 and 8 cm correctly for 20 out of 20 trials. The task pour bottle is slightly more difficult and 17 out of 20 trials are successful. For evaluation, we use 3 real bottles which shapes were not present in the simulation training set.

IV-F Comparison with state-of-the-art methods

Fig. 3: Comparison of BC ResNet-18 with state of the art [3] on the FetchPickPlace task.

One of the few test-beds for robot manipulation is FetchPickPlace from OpenAI Gym [9] implemented in mujoco [46], see Figure 3(a). The goal for the agent is to pick up a cube and to move it to the red target. The agent observes the three last RGB-D images from a camera placed in front of the robot ot100×100×4×3. The positions of the cube and the target are set at random for each trial. The reward of the task is a single sparse reward if the cube is within an ϵ ball of the target (depicted in red in Figure 3(a)). The maximum length of the task is set to 50 time-steps.

For a fair comparison with [3], we follow the same procedure for the dataset recording and do not use any data augmentation. Instead, we record each synthetic demonstration from a viewpoint picked at random using the same sampling space described in Section IV-C. We train a ResNet-18 policy using from 10 to 104 expert demonstrations. The results are reported in Figure 3(b) where we test our approach on 200 different initial configurations. We follow [3] and plot the success rate of both RL and IL methods with respect to the number of episodes used (either trials or synthetic demonstrations).

The success rate indicates that we outperform the policies trained with an imitation learning method DAgger [2] in terms of performance and RL methods such as HER [47] and DDPG [48] in terms of data-efficiency. According to the reported results, DAgger does not reach 100% even after 8*104 demonstrations while, unlike our method, requires an expert during training. HER reaches the success rate of 100% but requires about 4*104 episodes. Our approach achieves the 96% success rate using 104 demonstrations.

Our policies differ from [3] mainly in the CNN architecture. Pinto et al. [3] use a simple CNN with 4 convolutional layers while we use ResNet-18. Results of this section confirm the large impact of the CNN architecture on the performance of visual BC policies, as was already observed in Table I.

V Evaluation of HRL-BC

This section evaluates the proposed HRL-BC approach of learning an RL master policy on top of pretrained BC skills. We evaluate the method on two tasks described in Section V-A: UR5-Bowl and UR5-Breakfast illustrated in Figure 4. Both tasks are long and are trained with a single sparse reward of success. Section V-B compares our HRL-BC approach with a strong "BC-ordered" baseline where the skills are executed in a fixed and pre-defined order. We report results for simulated and real versions of both tasks. Note, that our real robot experiments are performed with skills and master policies that have been trained exclusively in simulation.

V-A Experimental setup

HRL-BC tasks. To evaluate HRL-BC we consider two UR5 tasks. The UR5-Bowl task has as goal to grasp the cube and to place it in the bowl as shown in Figure 4(a). The maximum episode length is 600. The UR5-Breakfast task contains a cup, a bottle and a bowl as shown in Figure 4(b). The agent needs to pour ingredients from the cup and the bottle in the bowl; the reward is positive if and only if all ingredients are in the bowl. The maximum episode length is 2000.

Skills and datasets. For each task we consider a set of skills defined by expert scripts. For the UR5-Bowl task, we define four skills: (a) go to the the cube, (b) go down and grasp, (c) go up, and (d) go to the bowl and open the gripper. To train BC skill policies, we collect a dataset of 250 synthetic demonstrated trajectories for each skill recorded from 5 viewpoints.

For the UR5-Breakfast task, we define four skills: (a) go to the bottle, (b) go to the cup, (c) grasp an object and pour it to the bowl, and (d) release the held object. We collect a dataset with 250 demonstrations containing cup pouring, bottle pouring and object placement in random locations. Synthetic demonstrations are rendered in 5 viewpoints and contain different bottles and cups from ShapeNet. We emphasize that the expert dataset does not contain full task demonstrations and that all our training is done in simulation.

Training. Similar to Section IV, we learn BC skills to mimic synthetic demonstrations generated by the expert script. Unlike BC skill learning in Section IV, we here train all skills of the same task simultaneously using the multi-head CNN architecture illustrated in Figure 1(right). We represent each head by a convolutional block from ResNet-18. Each head output is then averaged with an average pooling and used to predict the skill action using a linear layer. When training the RL master, we execute selected skills for 60 consecutive time-steps for the UR5-Bowl task and 220 time-steps for the UR5-Breakfast task. The master policy keeps switching the skills until either the maximum length of the episode or the success condition is reached. We train HRL-BC using 8 different random seeds in parallel. The rest of RL hyperparameters are described in Section III-C. During the RL and BC skill training we apply sim2real data augmentation [10] to enable successful execution of policies on a real robot.

(a) UR5-Bowl
(b) UR5-Breakfast
Fig. 4: Two complex manipulation tasks with a single success reward used for the HRL-BC experiments: (a) task of bringing the cube to the bowl, (b) task of pouring the cup and the bottle into the bowl. (Left) simulation, (right) real.

V-B Results

UR5-Bowl results.

We evaluate HRL-BC and compare it to a strong baseline with a fixed sequence of BC skills following the manually pre-defined correct order (BC-ordered). Our results for the simulated "UR5-Bowl (sim)" and real "UR5-Bowl (real)" environments are reported in Table III. While tested in simulation, both HRL-BC and the BC-ordered put the cube into the bowl with a success rate of 96%. HRL-BC manages to learn the sequence of skills given only sparse rewards.

We have also attempted to solve the UR5-Bowl task without skills by learning an RL policy performing low-level control. We initialized ResNet-18 on ImageNet, froze it and trained the same architecture used for the HRL-BC master policy

Task BC-ordered HRL-BC
UR5-Bowl (sim) 96% 96%
UR5-Bowl (real) 85% 90%
UR5-Breakfast-Simple (sim) 90% 90%
UR5-Breakfast-Simple (real) 80% 80%
UR5-Breakfast-Hard (sim) 42% 60%
UR5-Breakfast-Hard (real) 45% 80%
TABLE III: Comparison of HRL-BC and manually ordered BC skills (BC-ordered) for the UR5-Bowl and UR5-Breakfast tasks.

(3 convolutional layers with 64 filters of size 3×3) on top of the visual representation with PPO. Whereas PPO did not solve the task a single time after 104 episodes, HRL-BC reaches 96% after 400 episodes.

We evaluate UR5-Bowl on the real robot as illustrated in Figure 4(a)(right). While the ordered skills solve the task with a success rate 17/20, HRL-BC succeeds in 18 out of 20 episodes. While the ordered skills use a fixed number of time-steps to execute actions, the master policy can determine how close the cube is to the gripper and can recover from failures.

UR5-Breakfast results.

We evaluate HRL-BC on the challenging UR5-Breakfast task both in simulation and on the real robot. Results are reported in Table III. In particular, we define two setups with different distances between initial positions of the objects: UR5-Breakfast-Simple and UR5-Breakfast-Hard. In the hard setup, the bottle and the cup are placed close to each other and therefore the order of grasp becomes very important. Attempts to grasp wrong objects in this setup result in failures due to collisions of the robot with another object. While training HRL-BC, we use the same BC-skills for both setup. When tested in simulation, HRL-BC performs similar to BC-ordered on UR5-Breakfast-Simple (90% of successful trials) but outperforms the manual ordering on UR5-Breakfast-Hard (60% for HRL-BC vs. 42% for BC-ordered). While the ordered skills always grasp the predefined object, the HRL-BC can choose an object it should grasp first which is especially important on UR5-Breakfast-Hard. In the real world evaluation, both HRL-BC and ordered skills succeeded in 16 out of 20 episodes on the simple setup. However, on the hard setup the performance of ordered skills drops to 45% due to collisions. HRL-BC succeeds in 60% of trials by choosing the appropriate object to avoid collisions. Some qualitative results for HRL-BC are shown in Figure 4. The appendix presents more qualitative results. In particular, we demonstrate successful execution of HRL-BC while the robot is facing the challenges of previously unseen objects, dynamic scene changes and occlusions.

VI Conclusion

This paper introduces an approach for hierarchical RL with skills learned from synthetic demonstrated trajectories. Our method is well adapted for composite problems and requires no full-task demonstrations nor intermediate rewards. We show excellent results in simulation and on a real robot. Given our sample-efficient strategy for learning primitive skills, the proposed method should generalize to a variety of new tasks. Future work includes learning multiple tasks with shared skills, determining new skills automatically and addressing contact-rich tasks.


  • [1] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in NIPS, 1989.
  • [2] S. Ross and J. A. Bagnell, “Reinforcement and imitation learning via interactive no-regret learning,” in arXiv, 2014. [Online]. Available:
  • [3] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric Actor Critic for Image-Based Robot Learning,” RSS, 2018.
  • [4] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in ICML, 2000.
  • [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, 2015. [Online]. Available:
  • [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016.
  • [7] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular control for embodied question answering,” in CoRL, 2018.
  • [8] H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. D. III, “Hierarchical imitation and reinforcement learning,” in ICML, 2018.
  • [9] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multi-goal reinforcement learning: Challenging robotics environments and request for research,” arXiv, 2018. [Online]. Available:
  • [10] A. Pashevich, R. A. M. Strudel, I. Kalevatykh, I. Laptev, and C. Schmid, “Learning to augment synthetic images for sim2real policy transfer,” IROS, 2019.
  • [11] T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching and grasping skills using neural reinforcement learning,” IJCNN, 2013.
  • [12] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation,” in ICML, 2016. [Online]. Available:
  • [13] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” JMLR, 2015. [Online]. Available:
  • [14] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Data-efficient Deep Reinforcement Learning for Dexterous Manipulation,” arXiv, 2017. [Online]. Available:
  • [15] Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learning,” NIPS, 2017.
  • [16] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing - Solving sparse reward tasks from scratch,” MLR, 2018. [Online]. Available:
  • [17] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in NIPS, 2016.
  • [18] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation,” arXiv, 2016. [Online]. Available:
  • [19] M. Laskey, J. Lee, R. Fox, A. D. Dragan, and K. Goldberg, “DART: Noise injection for robust imitation learning,” in CoRL, 2017. [Online]. Available:
  • [20] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” IJRR, vol. 32, no. 11, 2013.
  • [21] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
  • [22] P. Dayan and G. Hinton, “Feudal Reinforcement Learning,” in NIPS, 1993.
  • [23] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman, “Meta learning shared hierarchies,” ICLR, 2018. [Online]. Available:
  • [24] Y. Lee, S.-H. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim, “Composing complex skills by learning transition policies with proximity reward induction,” in ICLR, 2019.
  • [25] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture.” in AAAI, 2017.
  • [26] C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” ICLR, 2017. [Online]. Available:
  • [27] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent space policies for hierarchical reinforcement learning,” ICML, 2018. [Online]. Available:
  • [28] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” NIPS, 2018. [Online]. Available:
  • [29] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in ICML, 2017. [Online]. Available:
  • [30] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in NIPS, 2016.
  • [31] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an embedding space for transferable robot skills,” in ICLR, 2018. [Online]. Available:
  • [32] Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcement learning from imperfect demonstrations,” ICLR workshop, 2018. [Online]. Available:
  • [33] C.-A. Cheng, X. Yan, N. Wagener, and B. Boots, “Fast policy learning through imitation and reinforcement,” UAI, 2018.
  • [34] W. Sun, J. A. Bagnell, and B. Boots, “Truncated horizon policy search: Combining reinforcement learning & imitation learning,” in ICLR, 2018.
  • [35] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep Q-Learning from demonstrations,” AAAI, 2018.
  • [36] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in ICRA, 2018. [Online]. Available:
  • [37] Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess, “Reinforcement and imitation learning for diverse visuomotor skills,” arXiv, 2018. [Online]. Available:
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
  • [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015. [Online]. Available:
  • [40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv, 2017. [Online]. Available:
  • [41] I. Kostrikov, “Pytorch implementations of reinforcement learning algorithms,”, 2018.
  • [42] E. Courmans and Y. Bai, “PyBullet, Python module for physics simulation, robotics and machine learning,”, 2016-2017.
  • [43] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3D model repository,” arXiv, 2015. [Online]. Available:
  • [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, 2014. [Online]. Available:
  • [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CVPR, 2016.
  • [46] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in IROS, 2012. [Online]. Available:
  • [47] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in NIPS, 2017.
  • [48] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” ICLR, 2016. [Online]. Available:

VII Appendix

We present additional qualitative results for the HRL-BC approach on the real robot with policies that have been trained in simulation, as in the main paper. Section VII-A describes and illustrates examples of UR5-Bowl and UR5-Breakfast policies. We demonstrate robot behavior while the robot is facing the challenges of previously unseen objects, dynamic changes of object locations and occlusions. We also illustrate feature map activations of the network providing better understanding of learned policies in Section VII-B.

VII-A Qualitative results

UR5-Bowl: Multiple objects.

We experiment with the HRL-BC policy trained in the UR5-Bowl environment. Once the robot succeeds to place a cube in the bowl, we put another cube on the table and let the policy continue, see Figure 5. While the UR5-Bowl policy has been trained to handle one cube only, it automatically generalizes to multiple cubes when run in a loop.

Fig. 5: HRL-BC approach for UR5-Bowl with two cubes.

UR5-Bowl: Previously unseen objects.

We further test the HRL-BC UR5-Bowl policy in the presence of previously unseen objects. While the policy has been trained to manipulate cubes of different sizes (see Section 5.1), we observe its robustness to other object shapes. As shown in Figure 6, the policy successfully grasps and places into a bowl real objects, such as apples, oranges, lemons, and toys. Notably, in cases of failing to grasps an object, the robot automatically recovers and completes the task. This behavior comes naturally from our HRL-BC master policy that has learned to adapt the sequence of skills given current observations of the scene.

Fig. 6: HRL-BC approach for UR5-Bowl with previously unseen objects.

UR5-Breakfast: New object instances.

To enable generalization of learned policies to new object instances, our UR5-Breakfast environment contains cups, bottles and bowls of different shapes from ShapeNet (see Section 5.1). During testing we run the learned HRL-BC UR5-Breakfast policy on a real robot and experiment with instances of bottles and cups unseen during training. Figure 7 demonstrates successful executions of the HRL-BC UR5-Breakfast policy in scenes with significant variations in object shapes, for example, using a wine glass instead of a cup.

Fig. 7: HRL-BC approach for UR5-Breakfast with previously unseen object instances.

UR5-Breakfast: Dynamic changes of object location.

Our BC skills make decisions at every time-step and, hence, can instantly adapt to changing conditions of the scene. We verify this by varying object positions during grasping attempts of the HRL-BC UR5-Breakfast policy. Figure 8 illustrates the reactive behavior of the robot grasping a cup that is being simultaneously moved by the person. The cup is successfully grasped after multiple changes of its position.

Fig. 8: HRL-BC approach for UR5-Breakfast with dynamic changes of object locations.

UR5-Breakfast: Occlusion.

Another example of the instant re-planning by our HRL-BC policy is demonstrated in Figure 9. While the robot approaches an object, we temporary occlude the object and disrupt the executing BC skill. Given the hierarchical nature of our HRL-BC approach, the RL master is able to recover from this failure by starting another skill that leads to the completion of the task. More precisely, when the bottle gets occluded in the example of Figure 9, the robot changes its strategy and decides to grasp and pour from a cup. Once the occlusion is removed, the robot automatically resumes and completes the task by grasping and pouring from the bottle.

Fig. 9: HRL-BC UR5-Breakfast policy executed in a robot scene with a temporary occluded bottle.

UR5-Breakfast: Failure case for BC-ordered.

Finally, we demonstrate the advantage of HRL-BC compared to the "BC-ordered" baseline with a fixed order of skills. For the UR5-Breakfast environment we define the order of skills for the BC-ordered policy as 1. "go to the cup", 2. "grasp the object and pour it to the bowl", 3. "release the object", 4. "go to the bottle", 5. "grasp the object and pour it to the bowl", 6. "release the object". While this policy succeeds in many cases, the near placement of objects presents a source of problems. Figure 10 shows an example scene with a cup and a bottle being near to each other while the cup is placed in front of the bottle. As the BC-ordered policy above is pre-programmed to grasp the cup first, the execution of this policy results in a collision between a gripper and a bottle, followed by the failure of the task. Notably, our HRL-BC policy learns to select the order of objects for grasping such as to avoid failures of the task. Hence, the HRL-BC automatically learns to avoid collisions without the need of specific intermediate rewards. This advantage of HRL-BC is demonstrated by quantitative results in Table 3.

Fig. 10: Illustration of a failure case for the BC-ordered approach for UR5-Breakfast with close by objects.

VII-B Feature map activations

The HRL-BC policy uses no explicit representation of scenes, for example in terms of categories and locations of objects. Some interpretation of learned policies, however, can be obtained by examining spatial activations of the neural network at intermediate network layers. Figure 11 shows silency maps of an HRL-BC policy highlighting which parts of the image the agents concentrates on. The silency maps are computed as activations of convolutional feature maps obtained from last layers of CNN(θ) and CNN(ηi),i=1,,K (see Figure 1), averaged over all channels and layers. The resulting heatmaps are shown for different stages of the UR5-Breakfast task. Interestingly, while grasping and moving the bottle, the network generates highest activations around the bottle and the gripper, while ignoring other objects. When releasing the bottle, however, high activations are also observed around the cup. The attention to the cup might be explained by the need of avoiding collisions when placing the bottle on the table. Note, that we provide no intermediate rewards to RL, however, RL learns to avoid collisions since collisions imply failures of the final task and, hence, no final positive reward during training. Once the bottle is placed on the table, activations become low for the bottle, while the heatmap obtains maximum values for the manipulated cup. Observations of such feature maps have been useful in our work to identify certain cases of failures. We believe feature map activations are a useful tool to interpret learned policies.

grasping bottle
moving bottle
pouring from bottle
releasing bottle
grasping cup
pouring from cup
Fig. 11: Left: Feature map activations of the HRL-BC UR5-Breakfast policy are overlayed on the input depth images. Right: corresponding frames taken from a different viewpoint.