Interactive Gibson: A Benchmark for Interactive Navigation in Cluttered Environments

  • 2019-10-30 01:04:37
  • Fei Xia, William B. Shen, Chengshu Li, Priya Kasimbeg, Micael Tchapmi, Alexander Toshev, Roberto Martín-Martín, Silvio Savarese
  • 19


We present Interactive Gibson, the first comprehensive benchmark for trainingand evaluating Interactive Navigation: robot navigation strategies wherephysical interaction with objects is allowed and even encouraged to accomplisha task. For example, the robot can move objects if needed in order to clear apath leading to the goal location. Our benchmark comprises two novel elements:1) a new experimental setup, the Interactive Gibson Environment, whichsimulates high fidelity visuals of indoor scenes, and high fidelity physicaldynamics of the robot and common objects found in these scenes; 2) a set ofInteractive Navigation metrics which allows one to study the interplay betweennavigation and physical interaction. We present and evaluate multiplelearning-based baselines in Interactive Gibson, and provide insights intoregimes of navigation with different trade-offs between navigation pathefficiency and disturbance of surrounding objects. We make our benchmarkpublicly available( andencourage researchers from all disciplines in robotics (e.g. planning,learning, control) to propose, evaluate, and compare their InteractiveNavigation solutions in Interactive Gibson.


Quick Read (beta)

Interactive Gibson:
A Benchmark for Interactive Navigation in Cluttered Environments

Fei Xia1, William B. Shen1, Chengshu Li1, Priya Kasimbeg1, Micael Tchapmi1
Alexander Toshev2, Roberto Martín-Martín1, Silvio Savarese1

We present Interactive Gibson, the first comprehensive benchmark for training and evaluating Interactive Navigation: robot navigation strategies where physical interaction with objects is allowed and even encouraged to accomplish a task. For example, the robot can move objects if needed in order to clear a path leading to the goal location. Our benchmark comprises two novel elements: 1) a new experimental setup, the Interactive Gibson Environment, which simulates high fidelity visuals of indoor scenes, and high fidelity physical dynamics of the robot and common objects found in these scenes; 2) a set of Interactive Navigation metrics which allows one to study the interplay between navigation and physical interaction. We present and evaluate multiple learning-based baselines in Interactive Gibson, and provide insights into regimes of navigation with different trade-offs between navigation path efficiency and disturbance of surrounding objects. We make our benchmark publicly available33 3 and encourage researchers from all disciplines in robotics (e.g. planning, learning, control) to propose, evaluate, and compare their Interactive Navigation solutions in Interactive Gibson.



1Stanford Artificial Intelligence Lab (SAIL), Stanford University. [feixia, bshen88, chengshu, kasimbeg, mtchapmi, robertom, ssilvio] 2Robotics at Google, [email protected]

I Introduction

Classical robot navigation is concerned with reaching goals while avoiding collisions [Siciliano:2007:SHR:1209344, bonin2008visual]. This definition of navigation is motivated by a wide variety of robot applications in factories or outdoor settings. As robots are increasingly deployed in complex and cluttered environments, physical interactions while navigating become not only unavoidable, but necessary. For example, when operating a robot in a cluttered home, the robot might need to push objects aside or open doors in order to be able to reach its destination. This problem is referred to as Interactive Navigation and in this paper we propose a principled and systematic way to study it (see Fig. 1).

Fig. 1: We study Interactive Navigation, tasks where a mobile agent that has to navigate from its initial location (bottom-left, cyan) to a final goal (top-left, red cross) can use interactions with the environment as part of the strategy. We propose a simulator and a benchmark for this problem, called Interactive Gibson, where we evaluate different modes of behavior, balancing optimality between path length (red, shortest path length but unfeasible effort to interact) and effort due to interaction with the environment (yellow, no interaction but longest path).

The “aversion to interaction” in robot mobile agents is easy to understand: real robots are expensive, and interacting with the environment presents safety risks. In Robotic Manipulation these challenges have been addressed by extensive use of physics simulation engines [todorov2012mujoco, coumans2013bullet, koenig2004design], which simulate object and robot dynamics with high precision and thus allow one to study manipulation in a safe manner. Further, these engines can be used to train models which are deployable in the real world.

Unfortunately, the above simulations are not suitable for navigation, because they lack the photorealism and complexity of the real-world spaces. As a result, in recent years we have seen a class of simulation environments [kolve2017ai2, dosovitskiy2017carla, xia2018gibson, savva2019habitat, chang2017matterport3d, juliani2018unity, armeni2017joint], which build upon renderers of real-world scans. These environments have the desired photo and layout realism, and provide sufficient scene complexity. They have enabled the development and benchmarking of learning-based navigation algorithms, and some have allowed relatively easy deployment of such algorithms on real robots [gupta2017cognitive, hirose2019deep].

Most of these established simulators, however, fall short of providing interactivity – scans of real worlds are static, and objects cannot be manipulated. Such interactivity is crucial for realistic cluttered environments. An agent might need to push objects away to be able to complete the navigation task. Further, in many situations there are various options for such interactivity (see Fig. 1) where the agent has to balance between longer paths and pushing larger objects.

In this work, we study Interactive Navigation in cluttered indoor environments at scale. We present a benchmark for this problem. To enable this benchmark, we endow the Gibson Navigation environment [xia2018gibson] (Gibson V1 in the following) with the ability to simulate interactions with objects. Our main contributions are as follows.

First, we introduce a novel simulation environment, called Interactive Gibson Env, which retains the photo-realism and scale of the original Gibson V1, but allows for interactions with objects. Compared to Gibson V1, Interactive Gibson Env has significantly faster rendering speed, which makes large-scale training possible. It also allows for more complicated interactions between the agent and the environment, such as picking/pushing objects, and opening doors. This environment opens up new venues for jointly training base and arm policies, allowing researchers to explore the synergy between manipulation and navigation.

As the second contribution, we propose to study Interactive Navigation in a general form, as navigating in a cluttered environment where interacting with objects is allowed and even needed in order to reach the goal. All of the segmented objects can be pushed subject to their mass and friction. We believe this setup is more realistic, particularly in cluttered indoor environments. Further, this problem is the first step toward tying navigation with manipulation.

A further contribution of this work is the definition of a benchmark, called Interactive Gibson Benchmark, around Interactive Gibson for Interactive Navigation agents. We present a performance metric which unifies two criteria: 1) the navigation success and path quality, and 2) the effort associated with the degree of disturbance to the surroundings. The former is formalized via the recently proposed SPL metric [anderson2018evaluation] while the latter is proportional to the mass displaced and / or force applied to the objects in the scene. Thus, this unifying metric captures a trade-off between shorter path length and less disturbance to the environment.

Finally, we present a set of learning-based baselines on two different robot platforms using established Reinforcement Learning (RL) algorithms. We show that we can incur different navigation behaviors quantitatively and qualitatively by varying the interaction penalty in the reward function.

II Related Work

II-A Benchmarking in Robotics

In many empirical disciplines (e.g. computer vision, natural language understanding, or machine learning), evaluation and benchmarking can be achieved by curating a dataset and providing a set of evaluation metrics. Due to robotics’ real-world component, benchmarking is less straightforward, as one is to deal with hardware and real environments [del2006benchmarks].

As a result, the community has proposed several formats for benchmarking. First, researchers have proposed to evaluate different components used in navigation. For example, Geiger et al. [geiger2013vision] provides datasets and metrics for evaluation of SLAM and related vision capabilities outside navigation, under the assumption that these are used in navigation systems. Such approaches, however, have less importance for the up-and-coming RL-based algorithms, which do not explicitly utilize vision capabilities.

A different, more integrated approach is to provide experimental specifications to be reproduced by researchers. Sprunk et al. [sprunk2016experimental] provide a thorough definition of physical space, conditions, and navigation tasks, which are reproduced in two physical locations. Despite the unquestionable realism and experimental reproducibility, the setup cost can be prohibitive. To overcome this barrier, Pickem et al. [pickem2017robotarium] have executed on the physical setup, and provided remote access for users to run experiments on physical robots.

A different attempt to benchmark navigation is to organize competitions [ozguner2007systems, balch2002ten, braunl1999research]. These have been organized as one-off events or on an annual basis. Despite the realism and fairness of such setups, their infrequency makes them less suitable for faster research development.

II-B Robot Simulation Environments

With improvements in realism of visual and physics simulation, as well as the increase of available assets, simulation engines are emerging as a scalable, realistic, and fair way to evaluate navigation algorithms.

From the perspective of visuals, environments use either game engine renderers or mesh renderers. In the former category there are AI2-THOR [kolve2017ai2] and VRKitchen [gao2019vrkitchen]. The benefit of using a game engine is that it has a clean workflow to generate and incorporate game assets and design customized indoor spaces. The downside is that game engines are usually optimized for speed instead of fidelity, so the physical simulation is not accurate and simulation of interactions resorts to magic spells. Further, game engine renderers are proprietary with expensive assets.

A different class of simulators uses 3D capturing methods to scan real environments. Examples are Gibson Environment [xia2018gibson] (Gibson V1), Habitat [savva2019habitat], and MINOS [savva2017minos]. These are quite scalable, and have real-world visuals and scene layouts. The proposed Interactive Gibson falls into this category. A downside is that there might be reconstruction artifacts due to the imperfection of 3D scanning, reconstruction, and re-texturing technology. Further, these are static environments, which limits manipulation capabilities. Specialized interactive environments have been proposed, however these are limited to very specific problems, e.g. door opening [urakami2019doorgym]. In this work, we endow the Gibson Environment with manipulation capabilities – by editing the scanned meshes and replacing objects with realistic-looking CAD models we achieve the missing interactivity at scale while improving the quality of the overall scenes.

II-C Evaluation of Robot Navigation

The overwhelming majority of navigation algorithms are evaluated on navigation success – getting successfully to the target. Failures are due to inability to find a path to target, or collisions [nowak2010benchmarks]. A more complex set of metrics are concerned with various aspects of safety and path quality [munoz2007evaluation]. More precisely, safety is quantified by clearance from obstacles and traversal of narrow spaces. Quality is often quantified by path length with respect to the optimal path and smoothness [ceballos2010quantitative].

In simulation, the most recent benchmark HabitatAI [savva2019habitat] for point-to-point navigation measures performance based only on path distance and success rate of reaching the goal [anderson2018evaluation].

We believe the above definition of safety and path quality is too limiting. Oftentimes manipulating objects (which would be labeled as collision by the above metrics) is needed and can be safely performed in order to accomplish a navigation task. Further, none of the above metrics is concerned with the energy trade-off of moving objects out of the way versus taking a longer path around them – a behavior quite natural for humans.

II-D Interactive Navigation

While the literature on autonomous robot navigation is vast and prolific, less attention has been paid to navigation problems that require interactions with the environment, what we call Interactive Navigation. In the robot control literature, several papers have approached the problem of opening doors with mobile manipulators [peterson2000high, schmid2008opening, petrovskaya2007probabilistic, jain2009behavior]. However, these approaches focus on this single phase and not on the entire Interactive Navigation task.

Stilman et al. [stilman2005navigation] study Interactive Navigation from a geometric motion planning perspective. In their problem setup, the agent has to reason about the geometry and arrangement of obstacles to decide on a sequence of pushing/pulling actions to rearrange them to allow navigation. This problem, named Navigation Among Movable Objects (NAMO) is studied in subsequent work [stilman2008planning, stilman2007manipulation, van2009path, levihn13]. Their solution requires knowledge of the geometry of the objects to plan, and the search problem is restricted to 2D space.

III Interactive Gibson Environment

Fig. 2: Simulator and output modalities. 3D view of the agent in the environment (a) and four of the visual streams provided by the Interactive Gibson Environment: RGB images (b), surface normals (c), semantic segmentation of interactable objects (d), and depth (e). In our experiments (Sec. V), only semantic segmentation and depth are used as inputs to our policy network.

The study of Interactive Navigation requires a reproducible and controllable environment where testing does not imply real risks for the robot. This advocates the use of simulation. Our previous work, the Gibson Environment [xia2018gibson], provided a simulation environment to train embodied agents on visual navigation tasks without interactions. The main advantage of Gibson V1 is that it generates photo-realistic virtual images for the agent. This enabled seamless sim2real transfer [hirose2019deep, kang2019generalization]. However, Gibson V1 cannot be used as a test bed for Interactive Navigation because neither the rendering nor the assets (hundreds of 3D photo-realistic models reconstructed from real-world environments) allow for changes in the state of the environment.

We present Interactive Gibson Environment, a new simulation environment built upon Gibson V1 with two main novelties. First, we present a new rendering engine that not only can render dynamical environments, but also runs much faster than that in Gibson V1, which results in faster training of RL agents. Second, we present a new set of assets which are objects of relevant classes for Interactive Navigation (e.g. doors, chairs, tables, …) that can be interacted with.

III-A Interactive Gibson Renderer

Gibson V1 performs image-based rendering (IBR) [shum2000review]. While achieving high photo-realism, IBR presents two main limitations. First, IBR is slow – Gibson V1 renders at only 25-40 fps on modern GPUs. In order to render the scene, the system must load images from all available viewpoints and process them on-the-fly. This process is computationally expensive and limits the rendering process on most systems to something close to real-time [hedman2016scalable]. For robot learning, especially sample inefficient methods such as model-free Reinforcement Learning [duan2016benchmarking], IBR-based simulation can be prohibitively slow.

Second, IBR can not be used for dynamic environments (e.g. changes resulting from interactions) because these changes make the images taken from the initial environment configuration obsolete. Moving objects or adding new objects to the environment is thus not compatible with IBR, which impedes its usage for tasks like Interactive Navigation.

Fig. 3: Annotation Process of the Interactive Gibson Assets In Gibson V1 each environment is composed by a single mesh (1); for Interactive Gibson we need to segment the instances of classes of interests to study Interactive Navigation (doors, chairs, tables, …) into separate interactable meshes; We use combination of a Minkowski SegNet [choy20194d] (2) and a connected component analysis (3) to generate object proposals (4). The proposals are manually aligned in Amazon Mechanical Turk (5) to the most similar ShapeNet [chang2015shapenet] model. Annotated objects are separated from the mesh and holes are filled (6), and the original texture is transfered to the new object model (7) to obtain photo-consistent interactable objects.

To overcome these limitations, in Interactive Gibson we replace image-based rendering with mesh rendering. This allows us to quickly train visual agents for Interactive Navigation tasks, where the agent not only navigates in the environment but also interactes with objects.

Our high-speed renderer is compatible with modern deep learning frameworks because the entire pipeline is written in Python with PyOpenGL, PyOpenGL-accelerate, and pybind11 with our custom C++ code[pybind11]. This results in lower overhead, reduced computational burden, and a significant speedup of the rendering process (up to 1000 fps at 256×256 resolution in common computers).

To further optimize processes that rely on the results of the renderer (such as vision-based training of RL agents), we enable a direct transfer of render images to tensors on GPU. Avoiding downloading to host memory reduces device-host memory copies and significantly improves rendering speed (7.9 times frame rate gain at 512 × 512 resolution and 28.7 times frame rate gain at 1024 × 1024 resolution).

III-B Interactive Gibson Assets

Gibson V1 [xia2018gibson] provides a massive dataset of high quality reconstructions of indoor scenes. However, each reconstruction consists of a single static mesh, which does not afford interaction or changes in the configuration of the environment (Fig. 3.1). For our Interactive Gibson Benchmark we augment the 106 scenes with 1984 interactable CAD model alignments of 5 different object categories: chairs, desks, doors, sofas, and tables. Our data annotation pipeline leverages both existing algorithms and human annotation to align CAD models to regions of the reconstructed mesh. To maintain the visual fidelity of replaced models, we transfer the texture from the original mesh to the CAD models.

Our assets annotation process (Fig. 3) is composed of the following combination of automatic and manual procedures (blue and pink blocks in Fig. 3): first, we automatically generate object region proposals using a state-of-the-art shape-based semantic segmentation approach (Fig. 3.2) with a further segmentation into instances (Fig. 3.3 and 4). These proposals are fed into a manual annotation tool [Avetisyan_2019_CVPR] where CAD models are aligned to the environment mesh. The resulting aligned CAD models (Fig. 3.5) are used to replace the corresponding segment of the mesh (Fig. 3.6) and the color of the original mesh is transferred to the CAD model to maintain visual consistency (Fig. 3.7) in the final interactable objects. Each stage of the pipeline is detailed below.

Object Region Proposal Generation: Since Gibson V1 contains over 211,000 square meters of indoor space, it is infeasible to inspect the entire space by human annotators. We thus rely on an automated algorithm to generate coarse object proposals. These are areas of the reconstructed mesh of the environments that has high probability of containing one or more objects of interest, and their corresponding class IDs. These proposals are then refined and further annotated by humans. We use a pretrained Minkowski indoor semantic segmentation model [choy20194d] to predict per-voxel semantic labels (Fig. 3.2). We then filter the semantic labels into instance segmentation (Fig. 3.3) through connected-component labeling [suzuki2003linear]. In areas with low reconstruction precision, the automatic instance segmentation results may contain duplicates as well as missing entries. These were manually corrected by in-house annotators (Fig. 3.4). In total, over 4,000 objects proposals resulted from this stage.

Object Alignment: The goal of this stage is to 1) select the most similar CAD model from a set of possibilities [chang2015shapenet], and 2) obtain the scale and the pose to align the CAD model to the reconstructed mesh. To obtain the alignments we use a modification of the Scan2CAD [Avetisyan_2019_CVPR] annotation tool. We crowdsourced each object region proposal from the previous stage as HITs (Human Intelligence Tasks) on the Amazon’s Mechanical Turk crowdsourcing market [doi:10.1177/1745691610393980].

The annotator is queried to retrieve the most similar CAD model from a list of possible shapes from ShapeNet [chang2015shapenet]. Then, the human has to annotate at least 6 keypoint correspondences between the CAD model and the scan object (Fig. 3.4). The scale and pose alignment is solved by minimizing the point-to-point distance among correspondences over seven parameters of a transformation matrix: scale (three), position (three), and rotation (one). Pitch and roll rotation parameters are predefined since the objects of interest almost always stand up-straight on the floor.

Object Replacement and Re-texturing: Based on the alignment data, we process the corresponding region of the original mesh. We eliminate the vertices and triangular faces close to or inside the aligned CAD model. The resulting mesh contains discontinuities and holes. We fill them using a RANSAC [Fischler:1981:RSC:358669.358692] plane fitting procedure (Fig. 3.6).

At this point we have replaced the parts of the reconstructed mesh by a CAD model. However, the models of ShapeNet are poorly textured. We improve visual fidelity and photo-realism by transferring the original texture from the images to the aligned CAD model [cignoni2008meshlab]. Finally, we correct for the small alignment noise in the CAD models’ positions by running quick physics simulations to ensure they do not intersect with the floors and walls. For the dynamic properties of the objects that are relevant to interactions such as weight and friction, we assume a common set of parameters: density and material friction. Although this approximation can deviate from the real values in the environment, it generates realistic interaction simulations.

The final result of the annotation is a new dataset of similar number of 3D reconstructed environments as the original Gibson V1 dataset where all objects of classes of interest for Interactive Navigation have been replaced by separate CAD models that can be interacted in simulation (Fig. 3.7).

III-C Interactive Gibson Agents

Benchmarking Interactive Navigation requires embodied agents. We provide as part of Interactive Gibson ten fully functional robotic agents, including eight models of real robot platforms: two widely used simulation agents (the Mujoco [todorov2012mujoco] humanoid and ant), four wheeled navigation agents (Freight, JackRabbot v1, Husky and TurtleBot v2), a legged robot (Minitaur), two mobile manipulators with an arm (Fetch and JackRabbot v2), and a quadrocopter (Quadrotor). The large variety of embodiment types allows for easy tests of different navigation and interaction capabilities in Interactive Gibson.

The Interactive Gibson Environment enables a variety of measurements for the navigation agents (Fig. 2). The agents can receive as observations from the environment: 1) information about their own configuration such as position in the floor plan (localization), velocity, collisions with the environment and objects, motion (odometry), and visual signals that include photo-realistic RGB images, semantic segmentation, surface normals, and depth maps, and/or 2) information about the navigation task such as position of the goal, and the ten closest next waypoints of the pre-computed ground-truth shortest path to the goal (separated by 0.2m). In Interactive Gibson Environment, the agents can control the position and velocity of each joint of their embodiments, including the wheels.

IV Interactive Gibson Evaluation Setup

The task of Interactive Gibson Benchmark is to navigate from a random starting location to a random goal location on the same floor. Both locations are uniformly sampled on the same floor place, and are at least 1m apart.

As a result of our annotation and refinement of Interactive Gibson assets, the environments include interactable objects in replace of the original objects for the following five categories: chairs, desks, doors, sofas, and tables. In addition to these existing objects in the scenes with their original poses, we add ten additional objects that are frequently found in human environments. The objects we include are baskets, shoes, pots, and toys as shown in Fig. 3(c). The models are acquired by high resolution 3D scans of common objects. The objects have the same weights in simulation as in the real world. The objects are randomly placed on the floor to create obstacles for the agents.

For each episode, we randomly sample an environment, the locations to place the ten additional objects, and the starting and goal location of the agent. The episode terminates when the agent either converges to the goal location or runs out of time. The agent converges to the goal location when the distance between them is below the convergence threshold, which is defined to be the same as the agent’s body width. The agent has 1,000 time steps to achieve its goal (equal to 100s of wall time).

Fig. 4: Interactable Objects in Interactive Gibson. (a) Topdown view of ten 3D reconstructed scenes with objects annotated and replaced by high resolution CAD models highlighted in green. (b) Retextured ShapeNet [chang2015shapenet] models obtained from our assisted annotation process (Sec. III-B). (c) Additional common objects randomly placed in the environment.

Interactive Navigation Score

To measure Interactive Navigation performance, we propose a novel metric that captures the following two aspects:

Path Efficiency: how efficient the path taken by the agent is to achieve its goal. The most efficient path is the shortest path assuming no interactable obstacles are in the way. A path is considered to have zero efficiency if it does not lead to the goal at all.

Fig. 5: Interactive Navigation Score (INS) at different α levels for Turtlebot. With α=0 (score based only on Effort Efficiency), the best performing agents are those that minimize interactions (blue). For α=1 (score based only on Path Efficiency, INS1=SPL) some of these agents are overly conservative and fail to achieve the goal at all (lower INS). One of the best performing agent (SAC with kint=0.1) strikes a healthy balance between navigation and interaction. With α=0.5, SAC has the best performance overall except when the interaction penalty is too large (kint=1). Markers indicate the mean of three random seeds per algorithm and interaction penalty coefficient evaluated in the two test environments.

Effort Efficiency: how efficient the agent spends its effort to achieve its goal. The most efficient way is to achieve the goal without disturbing the environment or interacting with the objects. The total effort of the agent is positively correlated with the amount of energy spent moving its own body and/or pushing/manipulating objects out of its way.

Path and Effort Efficiency are measured by scores, P𝑒𝑓𝑓 and E𝑒𝑓𝑓, respectively, in the interval [0,1]. The final metric, called Interactive Navigation Score or INS, captures both aspects aforementioned in a soft manner by a convex combination of Path and Effort Efficiency Scores:


INS captures the tension between taking a short path and minimizing effort – the robot can potentially take the shortest path (high Path Efficiency score) while pushing as many objects as needed (low Effort Efficiency score); or the robot can try to minimize effort (high Effort Efficiency score) by going around all objects and taking a longer path (low Path Efficiency score). In the evaluation we control the importance of the above trade-off by varying α[0,1], where α=1 corresponds to the classical pure navigation SPL score.

In order to define the above Path and Effort scores, we assume there are K movable objects in the scene indexed by i{1,,K}. For simplicity, we consider the robot as another object in the scene with index i=0. During a navigation run we denote li as the length of the path taken by the ith object. We denote navigation success by a indicator function 𝟙𝑠𝑢𝑐 that takes value 1 if the robot converges to the goal and 0 otherwise.

Then the Path Efficiency Score is defined as the ratio between the ideal shortest path length L* computed without any movable object in the environment, and the path lenght of the robot, masked by the success indicator function:


The most path-efficient navigation would mean the robot takes the shortest path, l0=L* and thus P𝑒𝑓𝑓*=1. Please note that because L* is computed without any object in the environment P𝑒𝑓𝑓*=1 may not be achievable in practice, depending on if the sampled location of the objects that the robot cannot move away intersect the shortest path. This definition of Path Efficiency is equivalent to the recent metric Success weighted by Path Length (SPL) [anderson2018evaluation] for the evaluation of pure navigation agents.

To define the Effort Efficiency Score, we denote by mi the masses of the robot (i=0) and the objects. Further, G=m0g stands for the gravity force on the robot and Ft stands for the amount of force applied by the robot on the environment at time t[0,T], excluding the forces applied to the floor for locomotion. The Effort Efficiency Score captures both the excess of displaced mass (kinematic effort) and applied force (dynamic effort) for interactions:


The most effort-efficient navigation policy is to not perturb any object except the robot “itself”: li=0 for i{1,,K} and Ft=0 for t[0,T]. In this case, E𝑒𝑓𝑓*=1.

Fig. 6: Trade-off between Path and Effort Efficiency for Fetch. With high interaction penalty (kint=1), the agents achieve higher Effort Efficiency, but at the cost of a much lower Path Efficiency. With low interaction penalty (kint=0.1), the agents achieve almost identical Path Efficiency as those trained with no interaction penalty (kint=0) and higher Effort Efficiency (e.g. avoiding unnecessary interactions). Markers indicate the mean of three random seeds per algorithm and interaction penalty coefficient evaluated in the two test environments.

V Evaluating Baselines on Interactive Gibson

Our goal for the evaluation is to find a unified solution that can be controlled to balance path efficiency and effort efficiency. To find this solution we use our proposed Interactive Gibson benchmark to evaluate and compare three widely used reinforcement learning algorithms on the Interactive Navigation task: PPO [schulman2017proximal], DDPG [lillicrap2015continuous], SAC [haarnoja2018soft] (implementations adopted from tf-agents [TFAgents] and modified to accommodate our environments). We randomly select eight Gibson scenes as our training environments and test our baseline agents in these (seen) environments and in two other (unseen) environments from the Interactive Gibson assets (scenes shown in Fig. 3(a)).

To train and evaluate our baseline agents we use two robotic platforms in simulation: TurtleBot v2 and Fetch. Due to their significantly different sizes and weights, these robots interact differently with the environment.

From the set of available observations in the Interactive Gibson Environment we employ in our baseline solutions the following: 1) goal location, 2) angular and linear velocity, and 3) the next ten waypoints of the pre-computed ground-truth shortest path, all in agent’s local frame. The observation vector also includes a depth map and a semantic segmentation mask of reduced resolution (68x68). The action space for our baselines is the joint velocity of the wheels.

V-A Reward Function

Our hypothesis is that the balance between path efficiency and effort efficiency (amount of interaction with the objects in the environment) can be controlled through the reward received by the RL agents. With this goal in mind we propose the following reward function:


Rsuc (suc from success) is a one-time sparse reward of value 10 that the agent receives if it succeeds in the navigation (i.e. converges to the goal). Rpot (pot from potential) is the difference in geodesic distance between the agent and the goal in current and previous time steps, Rpot=𝐺𝐷t-1-𝐺𝐷t (Rpot is positive when the distance between the agent and the goal decreases and negative when the distance increases). Rint (int from interaction) is the penalty for interacting with the environment: Rint=-kint𝟙int. 𝟙int is an indicator function for interaction with objects, and kint is the interaction penalty coefficient (positive), a hyper-parameter that represents how much the agent is penalized for interactions.

We experiment with a combination of three different interaction penalty coefficients kint={0,0.1,1.0}. We aim to investigate how different algorithms, robots, and (controllable) reward functions affect the navigation behavior in cluttered environments using our novel Interactive Gibson benchmark. We train the agents in 8 environments and report the test results on 2 unseen environments. The split can be found on the project website33footnotemark: 3 .

V-B Evaluation

Fig. 7: Qualitative results of the trade-off between Path and Effort Efficiency. With no interaction penalty (kint=0, first row), the agent follows the shortest path computed without movable objects, and interact with the objects in its way. With high interaction penalty (kint=1, second row) the agent avoids collisions and deviates from the shortest path (c). It sometimes fails to achieve the goal at all when being blocked (d).

Fig. 5 depicts the Interactive Navigation Score, INSα, for the evaluated agents using the Turtlebot embodiment. Overall, SAC obtains the best scores independently of the relative weight between Path and Effort Efficiency Scores. Based only on the Effort Efficiency (α=0), the best performing solutions are the ones trained to reduce interactions (kint=1, blue). Interestingly, SAC trained to moderately reduce interactions (kint=0.1, green) is the best performing agent independently of the balance between Path and Effort Efficiency except for α=0. The results with Fetch (in our project page) present the same distribution.

Fig. 6 shows the trade-off between Path and Effort Efficiency for the evaluated agents using the Fetch embodiment. As expected, agents penalized for interactions (kint=1, blue) obtain the best Effort Efficiency Scores but at the cost of a large Path Efficiency loss: reducing interactions causes these agents to deviate more from the shortest path and even to completely fail in the navigation task, as observed in Fig. 7, bottom row.

Fig. 8 shows the difference in navigation strategy of Turtlebot agents trained with different interaction penalties. When the penalty is high (kint=1, yellow), the agents avoid any contact with the environment at the cost of less efficient path execution. When the interactions are less (kint=0.1, orange) or not (kint=0, red) penalized, the agents sacrifice effort efficiency to increase path efficiency by interacting with movable objects. Note that even without interaction penalty, the agents learn to avoid very large objects (e.g. sofas, tables) since they cannot be pushed away by TurtleBots. The agents learn this object class-specific behavior from the semantic segmentation mask generated by the Interactive Gibson Environment. Agents embodied on the larger and more powerful Fetch robots can also move small tables and sofas and therefore learn to interact more.

Fig. 8: Navigation behaviors of different interaction penalties Top-down view of the trajectories generated by agents trained with DDPG using different interaction penalties and the TurtleBot embodiment. Depending on the penalty, the agent learns to deviate from the optimal path (blue) to avoid collisions with large objects (sofas) (kint=0), medium ones (baskets) (kint=0.1), or small ones (cups) (kint=1). The object class information is encoded in the semantic segmentation mask.

Our baselines generalize well to unseen environments: the difference in performance between the seen and unseen environments is not statistically significant. We perform a one sample t-test for evaluation results on training and test scenes measured by INS0.5. The p-value is 0.171 showing the interactive navigation solutions work equally well on unseen environments. We believe this is because, even for environments not seen during training, the robot has indirect access to the map of the environment via the shortest path input given as part of its observation (Sec. III-C). Additionally, even though the environments are different, the movable objects are of the same classes, which allows the robot to generalize how to interact or avoid collisions with them.

VI Conclusion and Future Work

We presented Interactive Gibson, a novel benchmark for training and evaluating Interactive Navigation agents. We developed a new photo-realistic simulation environment, the Interactive Gibson Environment, which includes a new renderer and more than one hundred 3D reconstructed real-world environments where all instances of object classes of relevance have been annotated and replaced by high resolution CAD models. We also proposed a set of metric called Interactive Navigation Score (INS) to evaluate Interactive Navigation policies. INS reflects the trade-off between path and effort efficiency, which is quantitatively and qualitatively shown with a set of baseline solutions. We plan to continue annotating other object classes to extend our benchmark to other types of interactive tasks such as interactive search and retrieval of objects. Interactive Gibson is publicly available for other researchers33footnotemark: 3 to test and compare their solutions for Interactive Navigation in equal conditions.


The authors would like to thank Junyoung Gwak and Lyne P. Tchapmi for helpful discussions. We thank Google for providing funding for labeling the scenes and training the models. Fei Xia would like to thank Stanford Graduate Fellowship for the support.


-A Additional Experimental Results

In this section, we provide additional experimental results that could give us some insights about how different interactive navigation models perform on different robotics platforms.

Fig. 9 compares the interactive navigation metrics between Turtlebot and Fetch robot. As can be seen from the scatter plot, Fetch is able to achieve the same path efficiency with higher effort efficiency. This is because Fetch is a much heavier robot than Turtlebot, and the effort efficiency takes into account the weight of the robot. Heavier robot is able to achieve higher effort efficiency as the objects moved are much lighter than the robot itself.

Fig. 9: Comparison of Turtlebot and Fetch. Because Fetch carries more weight, it is able to achieve success navigation with higher effort efficiency.

Fig. 10 shows that the two terms in effort efficiency are highly correlated. This is just a sanity check that each term of the effort efficiency makes sense.

Fig. 10: Dynamic disturbances and Kinematic disturbances are correlated.

In the main text, we showed the trade-off between path and efficiency for Fetch so here we included the trade-off plot for Turtlebot in Fig. 11. The results are overall similar. In the main text we showed the interactive navigation score at different α for Turtlebot so we include interactive navigation score for Fetch in Fig. 12.

Fig. 11: Trade-off between Path and Effort Efficiency for Turtlebot.
Fig. 12: Interactive Navigation Score (INS) at different α levels for Fetch.

Finally, we examined the INS0.5 on training set and test set. There is no statistical difference between performance on training set and test set, as can be seen in Fig. 13.

Fig. 13: Statistical test shows there is no performance drop in terms of INS on test set compared with training set.