A Mobile Manipulation System for One-Shot Teaching of Complex Tasks in Homes

  • 2019-09-30 22:03:07
  • Max Bajracharya, James Borders, Dan Helmick, Thomas Kollar, Michael Laskey, John Leichty, Jeremy Ma, Umashankar Nagarajan, Akiyoshi Ochiai, Josh Petersen, Krishna Shankar, Kevin Stone, Yutaka Takaoka
  • 57


We describe a mobile manipulation hardware and software system capable ofautonomously performing complex human-level tasks in real homes, after beingtaught the task with a single demonstration from a person in virtual reality.This is enabled by a highly capable mobile manipulation robot, whole-body taskspace hybrid position/force control, teaching of parameterized primitiveslinked to a robust learned dense visual embeddings representation of the scene,and a task graph of the taught behaviors. We demonstrate the robustness of theapproach by presenting results for performing a variety of tasks, underdifferent environmental conditions, in multiple real homes. Our approachachieves 85% overall success rate on three tasks that consist of an average of45 behaviors each.


Quick Read (beta)

A Mobile Manipulation System for One-Shot Teaching of Complex Tasks in Homes

Max Bajracharya*, James Borders*, Dan Helmick*, Thomas Kollar*
Michael Laskey*, John Leichty*, Jeremy Ma*, Umashankar Nagarajan*, Akiyoshi Ochiai*
Josh Petersen*, Krishna Shankar*, Kevin Stone*, Yutaka Takaoka*
Toyota Research Institute, Los Altos, CA
*all authors contributed equally.

We describe a mobile manipulation hardware and software system capable of autonomously performing complex human-level tasks in real homes, after being taught the task with a single demonstration from a person in virtual reality. This is enabled by a highly capable mobile manipulation robot, whole-body task space hybrid position/force control, teaching of parameterized primitives linked to a robust learned dense visual embeddings representation of the scene, and a task graph of the taught behaviors. We demonstrate the robustness of the approach by presenting results for performing a variety of tasks, under different environmental conditions, in multiple real homes. Our approach achieves 85% overall success rate on three tasks that consist of an average of 45 behaviors each.




Robotic capabilities that assist people with tasks in their homes can play a critical role in enabling them to age in place longer and live a higher quality life. However, the tasks people perform in their homes vary widely, and home environments, objects, and tasks are highly unstructured and extremely diverse. But one advantage a robot operating in a home has is that it only needs to work well in that home, and its actions can be specialized to that environment.

Based on these observations, we have developed a unique solution to enabling a general purpose robot to perform human-level tasks in diverse and complex human environments. Rather than program or train a robot to recognize a fixed set of objects or perform pre-defined tasks, we enable the robot to be easily taught new objects and tasks, with inherently robust behaviors, from a single human demonstration, which can then be executed autonomously in naturally varying conditions. Our system uses no prior object models or maps, and can be taught to associate a given set of behaviors to arbitrary scenes, objects, and voice commands from one demonstration of the behavior. Because tasks are graphs of behaviors linked to dense visual features, the system is easy to understand and failure conditions are easy to diagnose and reproduce. The perception system is trained offline on large existing supervised and unsupervised datasets, but the rest of the system requires no additional training data.

Fig. 1: We have developed a highly capable general purpose mobile manipulation robot (top) capable of being taught human-level tasks by linking what it sees (bottom left) to robust parameterized behaviors, through dense learned visual features (whose first three dimensions are shown on the bottom right), from a single human demonstration of the behavior.

Our solution consists of several key contributions:

  1. 1.

    We developed a mobile manipulation robot that is very physically capable, with high end-effector manipulability and a wide instantaneous visual field-of-view, which makes teaching from human demonstration easy.

  2. 2.

    Rather than teach direct task space motions, we use virtual reality (VR) to teach a set of parameterized behaviors, which combine collision-free motion planning and whole-body hybrid (position and force) Cartesian end-effector control, minimizing the taught parameters and ensuring robustness during execution.

  3. 3.

    We compute task specific learned dense visual pixelwise embeddings, which link the parameterized behaviors to the scene and enable it to compute a 6-DOF transform of taught behaviors at execution time. While robust to viewpoint change, lighting variations, and light clutter, we do not attempt to generalize beyond specific taught situations.

  4. 4.

    The behaviors of a task are taught independently, with visual entry conditions, and success-based exit criteria, which enable behaviors to be chained together in a dynamic task graph, allowing the robot to reuse taught behaviors to perform task sequences.

Our approach applies to both manipulation tasks, as well as navigation, where a taught trajectory is followed with a reactive path follower, and alleviates the need for an explicit global map. We have found the approach to be surprisingly robust and achieves a success rate of 85% on a set of challenging tasks that have an average of 45 behaviors per task. The use of learned dense embeddings makes the system robust to expected daily changes in the environment, including lighting and clutter. The use of parameterized hybrid control behaviors makes the system robust to limited accuracy mechanisms (we do not model arm flex or use any hand-eye coordination).

Currently, each behavior must be taught explicitly, and is linked to the relevant parts of the scene’s appearance. Taught behaviors do not generalize beyond the taught scenario. However, we envision this as a stepping stone to using the taught tasks to produce the examples and data required to generalize to new situations and tasks. This would then allow a robot that is taught one task, in one home, to share its skills with all other robots, in any other home, resulting in the significant increase in capability required to make more general purpose home assistance practical.

I-A Related Work

There have been many research mobile manipulation robots, some of which are now becoming commercially available. These include those with a wheeled base, linear stage torso, and a single arm [1], [2], [3] or dual arms [4], an arm on a quadruped legged base [5], or a more humanoid form factor on wheels [6] or legs [7], among others.

The 2015 DARPA Robotics Challenge Finals [8] presented a snapshot of many fully integrated mobile manipulation systems being remotely operated with semi-autonomous capabilities. Prior to that, the DARPA Autonomous Robotic Manipulation Software (ARM-S) program demonstrated performing complex dual-arm manipulation tasks fully autonomously [9], [10], [11]. However, these approaches relied heavily on prior models of objects and took significant engineering effort to perform new tasks.

Some core elements of our approach are commonly used in industrial manipulation applications, such as parameterized behaviors [12], hybrid position-force control [13] [14], task graph formulations including Petri nets and hierarchical state machines, and kinematic teach and repeat, but have been limited to structured and controlled environments.

Visual teach and repeat has been used in less structured environments for outdoor ground navigation [15] and aerial localization [16]. For manipulation, a related approach is trajectory transfer [17], where pixel-level correspondences between a reference scene and the current environment are used to warp a taught sequence to a new initial state. Unlike this approach, we apply registration at the behavioral level as opposed to the whole trajectory. We extend the idea to associate behaviors using generic learned dense pixelwise visual embeddings for feature matching [18], rather than estimating constraints from semantic class specific sparse keypoints [19].

Imitation learning from VR demonstration can also learn a mapping of images to control, such as using behavioral cloning [20], but still requires a large amount of data. Approaches to single-shot learning from demonstration include variants of model-based and model-free inverse reinforcement learning, online reinforcement learning, and meta learning, [21], [22], [23], [24] but these all still require significant amounts of data for training similar tasks, or a large number of policy executions, before being able to perform well on a new task. Some approaches also automatically generate the task graph [25]. Training behaviors in simulation and transfering to reality has shown some promise [26] but still requires some online refinement and is limited to what can be simulated.


We have developed a custom prototype mobile manipulation robot physically capable of performing a wide variety of household tasks, and specifically designed to make teaching of tasks easy. Making the system low cost was not a priority, but recent advances in sensors and actuators [27] [6] indicate that doing so is possible, and our approach is designed to be robust to low precision mechanisms. Our system consists of a combination of commercial-off-the-shelf (COTS) and custom hardware components. The current design is the result of several design iterations based on lessons-learned from performing a wide variety of tasks in real homes. We have experimentally found that a person in VR, seeing only what the robot sees and controlling its end-effectors, can do many household tasks, except highly dexterous or very high payload tasks.

II-1 Morphology and Actuation

The 100kg robot consists of a total of 31 degrees-of-freedom (DOFs) split into five subsystems: the chassis, torso, left arm, right arm, and head. The chassis consists of four driven and steerable wheels (eight total DOFs) that enable “pseudo-holonomic” mobility. The drive/steer actuator package is a custom modular design using brushless Maxon motors and planetary gearheads. The torso is five DOFs (yaw-pitch-pitch-pitch-yaw) built using a Motiv Robotics RoboMantis limb, which is derived from the JPL RoboSimian limb [28]. Each arm is a seven DOF Kinova Jaco2 arm. The two DOF pan/tilt head also uses Kinova actuators. The torso, arms, and head are all brushless motors with Harmonic Drive geartrains. Each arm also has a single DOF Sake Robotics gripper with under-actuated fingers. We modified the Sake gripper fingers to have 3D printed hooks that greatly improve the ability to pull handles or knobs with high force. We can also manually replace the gripper with custom tools, such as a sponge or a wiping pad, to enable different tasks.

II-2 Motor Control

For the chassis and torso actuators, we use Elmo EtherCAT motor controllers in DS-402 Cyclic Synchronous Position (CSP) mode to enable precisely coordinated motion between joints. The arms and neck are commanded over Ethernet and the grippers are commanded over an RS485 bus that runs through the slip rings of the arms.

II-3 Sensing

We use an ATI mini-45 force/torque sensor (running on the same RS485 bus as the grippers) at the wrist of each arm to measure interaction forces with the environment. Our perception sensors are consolidated on the pan/tilt head of the robot, with a very wide field-of-view, giving the robot and a person in VR significant context to perform the task. They consist of four Intel RealSense depth cameras, a pair of 5MP Basler cameras with a 7cm baseline, and a VectorNav IMU. The RealSense cameras are arranged in a 2x2 configuration to produce a depth image with a total field-of-view (FOV) of 110x80and the Baslers each have an FOV of 146x123. All cameras run over USB 3.0 using multiple USB roots on the motherboard, and are hardware triggered and timestamped by the computer. The six cameras are calibrated using Kalibr [29], with a double-sphere fisheye model [30] for the stereo pair.

II-4 Compute

All compute is performed on-board the robot. The compute system consists of an 18 core Intel i9 CPU and an Nvidia TitanV GPU. We use a Linux kernel with the Preempt RT patch applied. This compute system allows us to use a single computer for all of our processes, including both real-time control and perception. All of our inference is done on the GPU using TensorRT models.

II-5 Power

We use six standard BB2590 Li-Ion batteries, each with 294Wh of capacity. While running and performing tasks, the system draws between 650W and 750W. We developed a custom power board that handles power distribution, on-board battery charging, and emergency stopping.


Our software system (Figure 2) is designed to leverage our highly redundant and capable hardware safely and effectively. It is specifically architected to enable teaching of behaviors. It is also designed to enable fast iterative development and deployment, debugging, and visualization of the system.

Fig. 2: Our software architecture enables robust autonomous execution of taught tasks by processing visual and audio data, building up a world model, mapping visual inputs to taught behaviors, and executing sequences of behaviors.

III-A Software Architecture

III-A1 Infrastructure

The system is architected like many standard robotic systems [31] [32], with independent processes communicating via messages over a custom interprocess communication (IPC) implementation. The system is organized as a set of modules, one or more of which run in a system process, which handle a set of input messages and publish a set of output messages. All messages are logged and modules or sets of modules can be replayed deterministically in a single process, or as if it were running on the robot, in parallel.

III-A2 Visualization, Commanding, and Teaching

We developed a custom visualization tool that subscribes to messages and can display 2D, 3D, text, and temporal information. It can also be used to command the robot and inspect and modify the task sequences that the robot is capable of executing. The robot state and RGB-D data is also streamed live into a VR system, enabling a person to teach the robot parameters of behaviors. We define these parameters by using an HTC Vive VR headset with two hand controllers. This allows an operator to see the world from the perspective of the robot, annotate the 3D point cloud, and float detached robot end-effectors in space to define end-effector poses.

III-B Control Architecture

Our system provides several key levels of abstraction for controlling the robot, specifically making it easy to teach and execute robust task sequences.

III-B1 Real-time Control

The lowest levels provide real-time coordinated control of all of the robot’s DOFs. Real-time control consists of two processes working in coordination at 200Hz: Joint Control and Part Control. Joint Control implements low-level device communications and exposes the device commands and statuses in a generic way. It supports an arbitrary number of actuators, force sensors, and IMUs, and is configured at run-time, making supporting different robot variations convenient. It also provides the lowest level of safety checks. If an incoming command violates center of mass constraints, causes a self-collision, or violates joint state limits, a fault is triggered and the robot is brought to rest safely.

Part Control handles higher level coordination of the robot by dividing the robot into parts (right arm, head, etc.) and providing a set of parameterized controllers for each part. Commands from non-realtime processes set the desired controllers and parameters to be running at a given time. Arbitrary combinations of controllers are supported as long as their controlled parts do not overlap. It provides controllers for joint position and velocity, joint admittance, camera look-at, chassis position and velocity, and hybrid task space pose, velocity, and admittance control.

III-B2 Whole-body Planning

The next level of abstraction for controlling the robot is commanding end-effector task space control and automatically solving for the robot posture to achieve these desired motions. Whole-body inverse kinematics (IK) for hybrid Cartesian control are formulated as a quadratic program (QP) [33] and solved in real-time at 200Hz. Parts are subject to linear constraints on joint position, velocity, acceleration, and torque due to gravity, center of mass, and self-collisions and quadratic costs on Cartesian tracking, regularization, and distance to preferred postures.

Whole-body IK are used for non-realtime motion planning of Cartesian pose goals. Occupied environment voxels (Section III-C2) are fit with spheres and capsules and voxel collision constraints are added to the QP IK to prevent collisions between the robot and the world. Motion planning is performed using a rapidly-exploring random tree (RRT) [34], sampling in Cartesian space with the QP IK as the steering function between nodes. Planning in Cartesian space results in natural and direct motions, and using the QP IK as the steering function makes planning more reliable, as the same controller is used to plan and execute, reducing the possible discrepancies between the two. Similarly, non-realtime motion planning for joint position goals uses an RRT in combination with the Part Control joint position controller acting as the steering function.

III-B3 Parameterized Behaviors

The next level of abstraction defines parameterized behaviors, which are primitive actions that can be parameterized and sequenced together to accomplish a complex task. We have found that a small set of parameterized behaviors are sufficient to perform many tasks, however the software architecture supports quick addition of new behaviors as and when they are necessary. Our behaviors include (1) manipulation actions such as grasp, lift, place, pull, retract, wipe, joint-move, direct-control; (2) navigation actions such as drive with velocity commands, drive-to with position commands and follow-path with active obstacle avoidance [35]; and (3) other auxiliary actions such as look at and stop.

Each behavior can have single or multiple actions of different types such as joint or Cartesian moves for one or more parts of the robot. Each action can use different control strategies such as position, velocity or admittance control, and can also choose to use motion planning to avoid external obstacles or not. All motions, whether they use motion planning or not, ensure that there is no self-collision and that all motion control constraints are satisfied. Each behavior is parameterized by the different actions, which in turn will have their own parameters. For example, a grasp behavior consists of four parameters: gripper open angle, 6D approach, grasp and (optional) lift poses for the gripper. These four parameters define the following pre-defined sequence of actions: (1) open the gripper to desired gripper angle, (2) plan and execute a collision-free path for the gripper to the 6D approach pose, (3) move the gripper to the 6D grasp pose and stop on contact, (4) close the gripper, and (5) move the gripper to the 6D lift pose, if provided.

III-B4 Task Graphs

The final level of control abstraction is a task. A task is a sequence of sub-tasks made up of taught behaviors. A task graph (Figure 3) is a directed, cyclic or acyclic graph with different sub-tasks as nodes and different transition conditions as edges, including fault detection and fault recovery. Edge conditions include the status of each behavior execution, checking for objects in hand using force/torque sensors, voice commands, and keyframe matches to handle different objects and environments. The task graph is created at teach time by manually specifying nodes and transitions.

Fig. 3: Our task graphs sequence robust taught behaviors, with the ability to branch or loop based on visual or audio sensing or other conditions, enabling complex behaviors and fault recovery. The task graph shown here is for Task 3 (described in Section V-A).

III-C Perception Architecture

Our perception pipeline is designed to provide the robot with an understanding of the environment around it and to recognize what actions to take, given the task it has been taught. A single fused RGB-D image is created by projecting the four RealSense depth images into the single wide field-of-view left image of the high resolution color stereo pair. The system runs a set of deep neural networks to provide various pixel level classifications and feature vectors (or “embeddings”) which are then both accumulated into a temporal 3D voxel representation (Figure 2), as well as used to recall actions to perform, based on the visual features from a taught sequence.

III-C1 Learned Dense Pixel Embeddings

Based on experience testing in highly unstructured and diverse (“long tailed” [36]) environments, like homes, a key aspect of our system is that we do not pre-define object categories or assume any models of objects or the environment. Rather than explicitly detect and segment objects [37], and explicitly estimate 6-DOF object poses [38], we instead produce dense pixel level embeddings for object semantic classes and instances, and viewpoint invariant correspondences, and use the reference embeddings from a taught reference behavior to perform classification or pose estimation.

All of our learned models are fully convolutional, and map every pixel in the input RGB image to a point in an embedding space with a metric that is implicitly defined by a loss function and training procedure specific to each model. Common to all models is our feature extractor architecture, which consists of a ResNet [39] 101 encoder, and a variant of the Feature-Pyramid Network [40] decoder. Given an input RGB image of size height×width×3, the feature extractor produces an output of size 18height×18width×2048. These output features are fed to a final 1×1 convolution with an output depth that is chosen depending on the output. We use models trained for:

  • Semantic class: We detect all objects of a semantic class given a single annotated example by comparing the embeddings on the annotation to the embeddings we see everywhere else. We train this model using a discriminative loss function as described in [41], on the MSCOCO data set [42].

  • Object instance: This model is necessary for identifying or counting individual objects. We train the model to predict a vector (2D embedding) at each pixel, pointing to the centroid of the object containing that pixel. At run-time, we group all pixels that point to the same centroid to segment the scene.

  • 3D correspondence: This model produces per pixel embeddings that are invariant to view and lighting, so that any view of a given 3D point in a scene will map to the same embedding. We train this model using the same approach and loss function described in [18], on the ScanNet data set [43].

All of our models are written in TensorFlow, and converted and run on-board the robot with Nvidia TensorRT using 16-bit floating point precision on an Nvidia Titan-V GPU, with multiple processes coordinated using Nvidia’s Multi-Process Service (MPS). We are able to process about 90 megapixels per second.

III-C2 Voxel Mapping

The pixelwise embeddings (and depth data) for each RGB-D frame is then fused into a dynamic 3D voxel map [44]. Each voxel accumulates first and second order position, color, and embeddings statistics. Expiration of dynamic objects is based on back projection of voxels into the depth image. The voxel map is segmented using standard graph segmentation based on the semantic and instance labels, and geometric proximity. The voxel map is also collapsed down into a 2.5D map with elevation and traversability classification statistics.

The voxel map is used for collision free whole-body motion planning, while the 2.5D map is used for collision free chassis motions. For efficient 3D collision checking, the voxels in the map are grouped into capsule-shaped collision bodies using a greedy approach. The segmented objects are used by the behaviors to attach objects to hands when they are grasped.

III-C3 Keypoint Pose Estimation

Central to our one-shot teaching approach is being able to recognize features in the scene (or of a specific manipulation object) that are highly correlated to features recorded from a previously taught task. When a task is demonstrated by the user, features are saved throughout the task in the form of a keyframe, a saved RGB image containing a multi-dimensional embedding with depth (if valid) per pixel. The embeddings act as a feature descriptor that is ideally unique enough to establish per pixel correspondences at run-time, assuming that the current image is similar enough to the reference that existed at teach time. Since depth exists at (mostly) each pixel, correspondences can be used to solve for a delta pose between the current and reference images. Our keyframe matcher detects inliers using Euclidian constraints [45] and applies the standard Levenberg-Marquardt least-squares algorithm with RANSAC to solve for a 6-DOF pose. This delta pose serves as a correction that can be applied to adapt the taught behavior sequence to the current scene. Because we have embeddings defined at each pixel, we can define keyframes including every pixel in the image or only using pixels in a user-defined mask (where we selectively annotate regions of the image to be relevant for the task) or on an object (Figure 4).

Fig. 4: We use dense learned embeddings and geometric constraints to match a current scene (top) or part of a scene (bottom) to a previously taught one. For behavior sequences or various entry conditions, the best keyframe is computed and selected from a set (top right).

III-C4 Audio Processing

In addition to visual sensing, we also collect and process audio input. Ultimately, the audio provides another set of embeddings as input for teaching the robot, but for now we only train the system to recognize specific spoken words. The robot acquires input by asking questions and understanding spoken language responses from a person.

Spoken questions are produced using the eSpeak synthesizer. Spoken responses are understood using a custom keyword-detection module. The robot can understand a custom wakeword, a set of objects (e.g., “mug” or “bottle”) and a set of locations (e.g., “cabinet” or “fridge”) using a fully-convolutional keyword-spotting model. The input to the model is single-channel 16-bit audio captured at 16 kHz, from which a spectrogram and MFCC features are extracted. The input audio clip duration is 1300 ms, the spectrogram window is 30 ms, and the number of MFCC bins is 40. The model is trained using cross-entropy loss, and consists of three layers of convolutions, with max pooling after each layer, and ReLU activation functions.

The model listens for the wakeword every 32 ms; when the wakeword is detected, it looks to detect an object or location keyword in the following 2500 ms. A keyword must have been detected at least three times with at least probability 0.5 in order to be recognized. During training, noise is artificially added to make recognition more robust. The offline accuracy at identifying individual keywords is 98%.


Fig. 5: A person in virtual reality (upper left) can teach a variety of parameterized behaviors (shown in the menu on the lower left) by visualizing a robot model, what the robot is seeing, and flying tools around in 3D (right) to define the parameters of the behavior.

The ability to teach a robot how to perform a task in a given situation with a single demonstration is key to operating in diverse unstructured environments. We have found that matching learned keypoints in high resolution images to adapt parameterized behaviors is surprisingly robust to occlusions, dynamic clutter, and changing lighting. Furthermore, because the system is only trying to match embeddings associated with a specific behavior or object, it limits false positives (i.e. the robot does not need to understand everything around it, just the features that are specific to the behavior it is performing). However, we still rely on a human teacher to annotate at teach time what parts of the scene are important for the robot to use.

The keypoints are sufficiently accurate to produce the parameters of specific taught behaviors, which compensate for small errors with closed-loop hybrid position and force control. By using the combination of collision-free motions and contact behaviors, taught motions are robust to physical clutter or new objects in the scene. The robot’s end-effectors are also designed to be inherently robust to some error.

IV-A Teaching Process

To teach the robot a task, the operator uses a set of VR modes (Figure 5). Each behavior has a corresponding VR mode for setting and commanding the specific parameters of that behavior. Each behavior mode has customized visualizations to aid in setting each parameter, dependent on the type of parameter. For example, when setting the parameters for a pull door motion, the hinge axis is labeled and visualized as a line and candidate pull poses for the gripper are restricted to fall on the arc about the hinge. To aid in the teaching process, several utility VR modes are used, such as reverting behaviors, annotating the environment with relevant objects, and repositioning of the virtual robot, camera image, and menus in the VR world.

IV-B Execution of Taught Tasks

During execution, we expect the robot and parts of the environment to be different from what it was at teach time (i.e. the robot may be in a slightly different pose, the object may have moved, furniture may have moved, lighting may also be different, etc.). We rely on feature matching to find features in the environment that are similar to what was taught and establish a correction pose delta from matched feature correspondences. User taught behaviors are then transformed by the computed pose delta. Since our taught tasks are not dense breadcrumbs of the user actions but rather discrete parameters defining each behavior (e.g. end-effector poses, annotated rotation-axes), the robot only needs to correct a handful of target reference frames and then will autonomously plan collision-free paths to reach the goal state. This is true whether the task is a navigation or manipulation behavior. Our approach allows for multiple keyframes to be passed to the matching problem, which chooses the best keyframe at run-time based on the number of correspondences found.


To evaluate the robustness of our system and approach, experiments were performed with the mobile manipulation robot in multiple real homes. Here we present three tasks, performed ten times, in two homes for a total of 60 experiments in order to obtain a measurement of task robustness across natural variations (e.g. lighting conditions during different times of the day, minor variations in initial object poses, wheel slippage, etc.). No software or parameter value changes were made across any of the 60 experiments, and each task was taught only once for the 10 experiments done for each task in each home. The homes were not modified in any way, except for removing personally identifiable information from the scene. The robot operated entirely autonomously for each of the 60 experiments. Additionally, ad hoc experiments were performed with intentional variations of the home in order to test the bounds of the robustness of the system.

V-A Task Descriptions

The three tasks that we evaluated were:

V-A1 Task 1: Bottle from Refrigerator

The robot starts in a different room than the kitchen, drives to the kitchen, opens the refrigerator, grasps a bottle, closes the refrigerator, and then drives back to the original room with the bottle. The experiment is considered a success if the robot returns to the original location with the bottle.

V-A2 Task 2: Cup from Dishwasher

The robot opens the dishwasher, removes a cup, closes the dishwasher, and places the cup on the countertop. The experiment is considered a success if the cup ends on the countertop.

V-A3 Task 3: Moving Multiple Objects to Multiple Locations (Figure 3)

The robot asks the user which object to put away, grasps that object, asks the user where to put the object, then drives to the specified location and puts the object away. In these experiments, we used two objects (a cup and a bottle) and two locations (a table and a cabinet). Voice commands were used to specify the object and location. Additionally, the cabinet door could be in any of three states: open, closed, or partially open. The experiment is considered a success if the object ends in the specified location.

In addition to the 60 experiments performed for measurement of natural variation robustness, several more experiments for each task were performed to test the bounds of the robustness of the system using intentional variations. Task 1 variations included: putting the bottle on a different shelf than it was taught on, adding obstacles along the path, adding pictures/magnets to the refrigerator, and varying the lighting conditions by closing blinds and turning lights on/off. Task 2 variations included: varying the lighting conditions. Task 3 variations included: swapping the initial positions of the two objects, opening adjacent cabinet doors, re-arranging the items in the cabinet, adding obstacles along the path, and adding a placemat to the table.

V-B Task Results

The results from the 60 experiments performed in the two homes are shown in Table I (with examples of executing behaviors shown in Figure 6) and achieves an overall success rate across all three tasks and both homes of 85%. A significant contributor to the end to end success rate of our tasks was fault detection and recovery within the task graph (shown by the loops in Figure 3). On average, our three tasks consisted of 45 behaviors each that are executed in series. This means that behaviors result in success or recoverable failure 99.6% of the time (or irrecoverable failure 0.4% of the time). For the three tasks, the robot performs the task anywhere from 10x to 100x slower than a person performing the same task, with the average being 20x slower.

The task failures are all the result of two different failure modes. The first is that the pose estimate of the object or affordance is inaccurate, resulting in the behavior positioning the end-effector in such a way that the behavior fails (e.g. the gripper slips off the handle). The second is that the scene appears to look too much like a different keyframe than the desired behavior (e.g. the partial open cabinet appears to be closed), and so the wrong behavior is performed. For these experiments, no task failure was catastrophic, so if the robot had the error detection required for the failed cases, it could have tried again and potentially succeeded at the task.

Fig. 6: We performed a variety of tasks in multiple homes to evaluate the robustness of our system. The images here show autonomous execution of parts of the tasks in different homes.
TABLE I: Success rate for three complex tasks in two homes.
Task Home 1 Successes/Attempts Home 2 Successes/Attempts Total Success Rate (%)
Task 1 8/10 8/10 80
Task 2 8/10 10/10 90
Task 3 9/10 8/10 85

The system is quite robust to intentional variations of the scene and task. For example, the robot was able to successfully grasp the bottle from a different shelf in the refrigerator than it was taught on, it was able to avoid obstacles placed along the taught paths, and the keyframe matcher showed robustness to significant lighting changes and environmental changes such as opening cabinet doors and adding pictures to the refrigerator. There were a few systematic failures that were found with these intentional variations, such as large rotations of the objects to be grasped, that will be addressed in future work.


The combination of a highly capable and manipulable mobile robot with the ability to teach robust parameterized behaviors linked to dense visual embeddings from human demonstration in VR has proven to be surprisingly effective and robust to performing a wide variety of human-level tasks in real homes. While not able to generalize beyond the taught scenario, tasks are tolerant to natural variation that occurs in home environments. Because perception and behaviors are cleanly decoupled, much of the system could be tested and evaluated (or even synthesized) in simulation, which is likely key to eventually scaling a system to real users.

A key limitation of the current approach is that it requires teaching every task in VR, including explicitly annotating relevant parts of the scene, such as objects or articulated regions, for all possible discrete states of the environment (e.g. cabinet door open versus closed). Incremental improvements, such as automatically determining the relevant parts of the scene (based on what the robot picks up or moves, for example), and teaching with multiple views of a scene or object, could help alleviate some of these limitations. Because the system does not rely on real-time feedback in VR, a remote operator could teach the system when necessary.

To make the system more practical, we could either enable a regular home user to teach a robot to perform new tasks with only high level guidance, or we could enable the system to apply what it has been taught in one environment to a different but similar environment. This likely means that the system needs to evolve from using arbitrary view invariant features to identifying important aspects of the scene, such as affordances, and linking behaviors to them. However, if this can be achieved, it means that one behavior taught in one home, could be shared to other, possibly different robots, in different homes, significantly increasing the capability of all robots.


We would like to sincerely thank Priscilla G. Ma and Bernice Borders for hosting and supporting us during weeks of robot testing in their homes.


  • [1] Robotnik. (2015) RB-1. [Online]. Available: https://www.robotnik.eu/manipulators/rb-one/
  • [2] Fetch Robotics. (2015) Fetch. [Online]. Available: https://fetchrobotics.com/robotics-platforms/fetch-mobile-manipulator/
  • [3] Toyota. (2012) Human Support Robot. [Online]. Available: https://www.toyota-global.com/innovation/partner_robot/robot/
  • [4] Willow Garage. (2010) PR2. [Online]. Available: http://www.willowgarage.com/pages/pr2/overview
  • [5] Boston Dynamics. (2019) Spot. [Online]. Available: https://www.bostondynamics.com/spot
  • [6] Halodi. (2019) EVE R3. [Online]. Available: https://www.halodi.com/ever3
  • [7] Boston Dynamics. (2017) Atlas. [Online]. Available: https://www.bostondynamics.com/atlas
  • [8] M. Spenko, et al., The DARPA Robotics Challenge Finals: Humanoid Robots To The Rescue, 1st ed., ser. Springer Tracts in Advanced Robotics.   Springer Int. Publishing, 2018, vol. 121.
  • [9] N. Hudson, et al., “Model-based autonomous system for performing dexterous, human-level manipulation tasks,” Autonomous Robots, vol. 36, pp. 31–49, 2013.
  • [10] L. Righetti, et al., “An autonomous manipulation system based on force control and optimization,” Autonomous Robots, vol. 36, pp. 11–30, 01 2014.
  • [11] J. A. D. Bagnell, et al., “An integrated system for autonomous robotics manipulation,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, October 2012, pp. 2955–2962.
  • [12] T. Kröger, et al., “Manipulation primitives — A universal interface between sensor-based motion control and robot programming,” in Robot Systems for Handling and Assembly, 1st ed., ser. Springer Tracts in Advanced Robotics.   Springer, 2010, vol. 67, pp. 293–313.
  • [13] P. G. Backes, “Generalized compliant motion with sensor fusion,” in Fifth Int. Conf. on Advanced Robotics ’Robots in Unstructured Environments, June 1991, pp. 1281–1286 vol.2.
  • [14] T. Kröger, “Hybrid switched-system control for robotic systems,” On-Line Trajectory Generation in Robotic Systems, 01 2010.
  • [15] P. Furgale and T. Barfoot, “Visual teach and repeat for long‐range rover autonomy,” Journal of Field Robotics, 2010.
  • [16] M. Fehr, et al., “Visual-inertial teach and repeat for aerial inspection,” CoRR, vol. abs/1803.09650, 2018.
  • [17] J. Schulman, et al., “Learning from demonstrations through the use of non-rigid registration,” in Robotics Research.   Springer, 2016, pp. 339–354.
  • [18] T. Schmidt, et al., “Self-supervised visual descriptor learning for dense correspondence,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 420–427, 2016.
  • [19] P. R. Florence, et al., “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” Conf. on Robot Learning (CoRL), October 2018.
  • [20] T. Zhang, et al., “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in 2018 IEEE Int. Conf. on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1–8.
  • [21] P. Englert and M. Toussaint, “Learning manipulation skills from a single demonstration,” The Int. Journal of Robotics Research, vol. 37, no. 1, pp. 137–154, 2018.
  • [22] J. Fu, et al., “One-shot learning of manipulation skills with online dynamics adaptation and neural network priors,” CoRR, vol. abs/1509.06841, 2015.
  • [23] C. Finn, et al., “One-shot visual imitation learning via meta-learning,” CoRR, vol. abs/1709.04905, 2017. [Online]. Available: http://arxiv.org/abs/1709.04905
  • [24] A. Zhou, et al., “Watch, try, learn: Meta-learning from demonstrations and reward,” CoRR, vol. abs/1906.03352, 2019. [Online]. Available: http://arxiv.org/abs/1906.03352
  • [25] D. Huang, et al., “Neural task graphs: Generalizing to unseen tasks from a single video demonstration,” CVPR, 2019.
  • [26] Y. Chebotar, et al., “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” CoRR, vol. abs/1810.05687, 2018. [Online]. Available: http://arxiv.org/abs/1810.05687
  • [27] D. V. Gealy, et al., “Quasi-direct drive for low-cost compliant robotic manipulation,” CoRR, vol. abs/1904.03815, 2019. [Online]. Available: http://arxiv.org/abs/1904.03815
  • [28] P. Hebert, et al., “Mobile manipulation and mobility as manipulation - design and algorithms of RoboSimian,” J. Field Robotics, vol. 32, no. 2, pp. 255–274, 2015. [Online]. Available: https://doi.org/10.1002/rob.21566
  • [29] P. Furgale, et al., “Unified temporal and spatial calibration for multi-sensor systems,” in 2013 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Nov 2013, pp. 1280–1286.
  • [30] V. Usenko, et al., “The double sphere camera model,” Proc. of the Int. Conf. on 3D Vision (3DV), September 2018.
  • [31] G. Reeves and J. Snyder, “An overview of the mars exploration rovers’ flight software,” in IEEE Int. Conf. on Systems, Man and Cybernetics, vol. 1, 11 2005, pp. 1 – 7 Vol. 1.
  • [32] M. Quigley, et al., “ROS: an open-source robot operating system,” in ICRA Workshop on Open Source Software, 2009.
  • [33] K. Shankar, et al., “A quadratic programming approach to quasi-static whole-body manipulation,” in Algorithmic Foundations of Robotics XI, vol. 107, 2015, pp. 553–570.
  • [34] J. J. Kuffner and S. M. LaValle, “RRT-connect: An efficient approach to single-query path planning,” in 2000 IEEE Int. Conf. on Robotics and Automation, 2000, pp. 995–1001.
  • [35] D. M. Helmick, et al., “Slip-compensated path following for planetary exploration rovers,” Advanced Robotics, vol. 20, no. 11, pp. 1257–1280, 2006.
  • [36] G. V. Horn and P. Perona, “The devil is in the tails: Fine-grained classification in the wild,” ArXiv, vol. abs/1709.01450, 2017.
  • [37] K. He, et al., “Mask R-CNN,” 2017 IEEE Int. Conf. on Computer Vision (ICCV), pp. 2980–2988, 2017.
  • [38] J. Tremblay, et al., “Deep object pose estimation for semantic robotic grasping of household objects,” in Proceedings of The 2nd Conf. on Robot Learning, vol. 87, 29–31 Oct 2018, pp. 306–316.
  • [39] C. Szegedy, et al., “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conf. on Artificial Intelligence, 2017.
  • [40] T.-Y. Lin, et al., “Feature pyramid networks for object detection,” in Proceedings of the IEEE conf. on computer vision and pattern recognition, 2017, pp. 2117–2125.
  • [41] B. De Brabandere, et al., “Semantic instance segmentation with a discriminative loss function,” arXiv preprint arXiv:1708.02551, 2017.
  • [42] T.-Y. Lin, et al., “Microsoft COCO: Common objects in context,” in European conf. on computer vision.   Springer, 2014, pp. 740–755.
  • [43] A. Dai, et al., “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839.
  • [44] M. Bajracharya, et al., “Real-time 3d stereo mapping in complex dynamic environments,” in Int. Conf. on Robotics and Automation - Semantic Mapping, Perception, and Exploration (SPME) Workshop, vol. 15.   IEEE, 2012.
  • [45] H. Hirschmuller, et al., “Fast, unconstrained camera motion estimation from stereo without tracking and robust statistics,” in Proceedings of the 7th Int. Conf. on Control, Automation, Robotics and Vision, ICARCV 2002, Jan 2003, pp. 1099 – 1104 vol.2.