DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation

  • 2019-09-26 14:33:31
  • Pavel Kirsanov, Airat Gaskarov, Filipp Konokhov, Konstantin Sofiiuk, Anna Vorontsova, Igor Slinko, Dmitry Zhukov, Sergey Bykov, Olga Barinova, Anton Konushin
  • 24


We present a novel dataset for training and benchmarking semantic SLAMmethods. The dataset consists of 200 long sequences, each one containing3000-5000 data frames. We generate the sequences using realistic home layouts.For that we sample trajectories that simulate motions of a simple home robot,and then render the frames along the trajectories. Each data frame contains a)RGB images generated using physically-based rendering, b) simulated depthmeasurements, c) simulated IMU readings and d) ground truth occupancy grid of ahouse. Our dataset serves a wider range of purposes compared to existingdatasets and is the first large-scale benchmark focused on the mappingcomponent of SLAM. The dataset is split into train/validation/test partssampled from different sets of virtual houses. We present benchmarking resultsforboth classical geometry-based and recent learning-based SLAM algorithms, abaseline mapping method, semantic segmentation and panoptic segmentation.


Quick Read (beta)

DISCOMAN: Dataset of Indoor SCenes for Odometry,
Mapping And Navigation

Pavel Kirsanov, Airat Gaskarov, Filipp Konokhov, Konstantin Sofiiuk, Anna Vorontsova,
Igor Slinko, Dmitry Zhukov, Sergey Bykov, Olga Barinova, Anton Konushin
Samsung AI Center

We present a novel dataset for training and benchmarking semantic SLAM methods. The dataset consists of 200 long sequences, each one containing 3000-5000 data frames. We generate the sequences using realistic home layouts. For that we sample trajectories that simulate motions of a simple home robot, and then render the frames along the trajectories. Each data frame contains a) RGB images generated using physically-based rendering, b) simulated depth measurements, c) simulated IMU readings and d) ground truth occupancy grid of a house. Our dataset serves a wider range of purposes compared to existing datasets and is the first large-scale benchmark focused on the mapping component of SLAM. The dataset is split into train/validation/test parts sampled from different sets of virtual houses. We present benchmarking results for both classical geometry-based [25, 9] and recent learning-based [6] SLAM algorithms, a baseline mapping method [35], semantic segmentation [4] and panoptic segmentation [29]. The dataset and source code for reproducing our experiments will be publicly available at the time of publication.

I Introduction

Simultaneous localization and mapping (SLAM) is an important component of robotic systems. Recently, the task of semantic SLAM has gained attention of the research community. It involves several components: trajectory estimation, mapping and semantic scene understanding. However, most of existing relevant datasets and benchmarks target distinct aspects of this complex task. Several benchmarks focus on trajectory estimation [31, 12, 13, 2, 27]. The others target semantic understanding [30, 7, 16]. Existing benchmarks for the mapping component of SLAM, e.g. Intel Lab Data [19] are quite small and lack diversity. Evaluation of SLAM methods requires information about the poses of the camera. However, in order to obtain camera poses in indoor environments one needs special equipment, e.g. motion capture systems. For this reason real-world benchmarks for SLAM usually contain rather short trajectories across a small area (for instance, one room only).

Recently computer graphics-generated datasets became popular for benchmarking computer vision models [16]. It was shown that physically-based rendering can be successfully used for training computer vision models [34]. Among the advantages of synthetic data are perfect ground truth annotation, control over difficulty and diversity of the data and an opportunity to obtain virtually unlimited number of samples. Over the last decade millions of designers have created an abundance of detailed and realistic 3d models of indoor environments. This wealth of data has a great potential for benchmarking semantic SLAM systems and improving the algorithms.

Fig. 1: DISCOMAN dataset provides realistic indoor sequences with ground truth annotation for odometry, mapping and semantic segmentation.

In this work we present a new synthetic dataset called DISCOMAN (Dataset of Indoor SCenes for Odometry, Mapping And Navigation). It is generated using physically based image rendering with realistic lighting models. The data is obtained from the original home layouts created for refurbishment of real houses. We synthesize realistic trajectories as ground truth to render image sequences at video frame rate. In contrast to the existing datasets for SLAM that contain short sequences [24, 21] we generate long trajectories simulating behaviour of a smart robot exploring a new home. The trajectories are more complex and diverse than in KITTI [12], but not as sophisticated as in hand-held datasets like TUM RGB-D [31] - see Figure 2. Aside from rendering RGB images, we generate perfect and noised depth images and a pixel-accurate semantic annotation of object classes. We also generate ground truth occupancy grid for the visited part of a house. This can be used for training and benchmarking the mapping component of SLAM. Compared to existing benchmarks ours is an order of magnitude larger and much more diverse. It contains 200 long sequences, each of those contains about 3000-5000 data frames. This amount of data is enough for training and comprehensive evaluation of the models and at the same time is feasible to download and process. See Table I for comparison with existing datasets.

per sequence
Depth Stereo IMU
2d map
TUM RGB-D [31] real
TUM VI [27] real
hand-held 2000
EuRoC [2] real
MAV 3000
ScanNet [7] real
from motion
3d scanning
ICL-NUIM [13] render ground truth random 1000
SceneNet RGB-D [24] render ground truth random 300
InteriorNet [21] render ground truth random 1000
DISCOMAN render ground truth robot 3000-5000
TABLE I: Comparison of indoor datasets with camera poses.
Fig. 2: Sample trajectories from outdoor KITTI [12] (top row), and indoor DISCOMAN (middle row) and TUM RGB-D [31] (bottom row) benchmarks. The trajectories in DISCOMAN are slightly more difficult compared to KITTI, but less complex compared to TUM RGB-D.

Multiple algorithms for constructing maps have been proposed [8, 23, 18], some of them are based on deep learning [22, 10, 33]. However benchmarking of these methods is currently complicated due to lack of a suitable dataset. Since there are no conventional metrics for evaluating an accuracy of mapping algorithms, in this work we introduce and describe the new set of metrics for evaluation of mapping accuracy. We believe that our benchmark can bring new insights and facilitate development of more accurate and robust methods for mapping.

Using the generated dataset we perform a comprehensive evaluation of current state-of-the-art methods. Our evaluation includes visual SLAM/odometry methods, namely classical ORBSLAM [25] and more recent learning-based method [6], an Open3D-based method for mapping [35] and a state-of-the-art semantic segmentation method [4]. These results can be used as a baseline for further research.

The rest of the paper is organized as follows. In section II we discuss related works. In section III we describe in details the process of data generation that involves trajectories sampling and rendering. Section V is devoted to experiments, and section VI is left for conclusions.

II Related work

The closest work to ours is InteriorNet [21], that presents a mega-scale indoor dataset containing a large number of short synthetic sequences. Similarly to our work they used physically based rendering for data synthesis. However, it is worth noting that only a small number of InteriorNet sequences are now available for public use and all of them are based on randomized motions. Compared to InteriorNet we focus more on the mapping component of SLAM. Thus, we generate longer sequences (about 3000-5000 frames length compared to 1000 frames in InteriorNet) and provide ground truth maps along with the sequences.

(a) (b) (c)
Fig. 3: Samples of generated trajectories. Color coding: red - sampled keypoints, blue - final trajectory after smoothing, black - occupied areas, white - free area, grey - the area of an image where keypoints cannot be sampled. One can see the effect of choosing different number of keypoints per trajectory: (a) 10 keypoints, (b) 30 keypoints, (c) 100 keypoints per trajectory. One can see that the more keypoints we add, the more curved the trajectory gets.

Another great example of a multi-purpose dataset is the renowned KITTI benchmark suite [12]. It provides real data with different types of annotation including camera poses for evaluation of SLAM/odometry methods, semantic/panoptic segmentation and object bounding boxes in 2d and 3d. However this dataset is highly specialized for self-driving, e.g. the trajectories are composed mainly of straight lines and contain very few turnings. Another problem with KITTI is low diversity of the sequences. It contains only 22 sequences taken in very similar conditions. In this work we provide an order of magnitude more sequences sampled from diverse indoor environments.

The most popular real-world indoor datasets for evaluation of trajectory estimation are the TUM RGB-D benchmark [31] containing RGB-D sequences and EuRoC [2] containing stereo+IMU sequences. TUM RGB-D contains both hand-held trajectories and the trajectories taken from a robotic platform. The recent TUM VI [27] dataset is desinged for benchmarking visual intertial odometry. The popular synthetic ICL-NUIM dataset [13] has a few RGB-D sequences with modelled noise of a depth sensor. All sequences in ICL-NUIM are sampled from randomized trajectories across two 3d models. Each of the mentioned real-world datasets contain about a dozen sequences. The small scale of the datasets and lack of diversity makes it difficult to reason about the robustness of SLAM methods.

ScanNet dataset [7] is a great effort to collect 3d models of real houses using structure-from-motion technique. This benchmark targets semantic and instance segmentation in 2d and 3d. But the trajectories in this dataset are highly specific for 3d scanning applications with abundance of loopy motion patterns that are not relevant to robotic applications. Compared to ScanNet we focus more on robotic applications and create trajectories accordingly. Our dataset contains longer sequences with more robot-like motion patterns.

Matterport3D [3] and 2D-3D-S [1] are the other examples of datasets collected with a 3d scanner, i.e. Matterport camera. They provide 3D real-world scenes with raw 3D point clouds, segmentation and reconstructed meshes. However the 3d scanning process with Matterport cameras does not produce smooth trajectories, and the quality of 3d models does not allow for interpolation between the frames.

A few synthetic datasets relevant to this work have been proposed. SceneNet RGB-D [24] contains millions of frames organized in sequences corresponding to complex camera trajectories. This dataset was generate using randomly cluttered furniture, thus the main drawback of SceneNet RGB-D is low realism. Virtual KITTI [11] is a synthetic dataset of outdoor scenes labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth and optical flow. It is also worth to mention, that a number of simulators for reinforcement learning have appeared recently.

Fig. 4: Example frames from DISCOMAN dataset. From top to bottom: RGB image, depth with emulated sensor noise, pixel-wise semantic annotation. Notice holes in depth maps for reflecting and black surfaces.

III Dataset generation

Trajectories generation algorithm. Our goal is to realistically model motions of a robot within a given scene. An algorithm that we use for trajectories generation includes the following steps. First, we compute 3D occupancy grid within scene bounding box with constant size of a grid cell (5cm in our work). Then we find traversable grid cells, i.e the ones that lie not closer than a given distance to the obstacles (20cm in our work). Next, we uniformly sample N random points from a set of traversable nodes. Point count depends on scene accessible area. We have noticed that such point density (point count per square meter) is highly correlated with trajectory complexity, i.e. linear and angular acceleration/deceleration. Then we apply travelling salesman problem (TSP) solver algorithm to find the order for visiting points, so each point is visited only once. After that we compute weighted shortest path passing through sampled points. The weights are inversely proportional to the distance between the agent and the closest obstacle. Finally we generate path between the points using full state planning algorithm, which takes into account linear/angular velocity/acceleration limits for a given robot. Each trajectory can be sampled with desired time resolution between frames. We choose 150 Hz sampling rate for IMU data representation and 30 Hz for image sensor data representation.

Rendering. We have developed a custom visualization engine named Renderbox. It is capable of producing various robotics-specific data as well as generating true physically-based shaded images. Renderbox consists of two image generation back-ends: multi-threaded CPU raytracing renderer adapted for cluster infrastructure and a GPU accelerated rasterization renderer. Both of them use the same scene graph, which made possible smooth and instantaneous data transitions through the whole rendering pipeline.

RGB images are generated using raytracing algorithm. We chose bidirectional path-tracing with pre-gathered and pre-filtered photon maps as a good compromise between suitable performance rate and visually pleasing results. For solving ray-triangle intersection problem we use Intel Embree library. Our physically-based rendering model allows us to vary scene visual representation conditions by applying a number of effects. Currently, it uses approximation of ambient occlusion effect which provides realistically looking images. While the raytracing back-end is used for rendering RGB images, depth and segmentation maps are generated in real-time using OpenGL API. Examples of rendered data are shown in Figure 4.

IV Dataset description

The dataset is split into train, validation and test parts and is designed for the following tasks: trajectory estimation, mapping and semantic segmentation.

IV-A Trajectory estimation

We formulate this task as follows. Given an input sequence one needs to estimate corresponding positions and orientations of a robot.

Metrics. To compute the metrics, the estimated and ground truth trajectories first need to be aligned. We use Horn method [17], which finds the rigid-body transformation S. Then we compute standard ATE (Absolute Trajectory Error) and RPE (Relative Pose Error) metrics. Below we formally define those metrics to avoid ambiguity.

Let us define absolute trajectory error matrix at time i as:


The ATE is defined as the root mean square error from error matrices:


Actually, absolute trajectory error is the average deviation from ground truth trajectory per frame.

The relative pose error measures the local accuracy of the trajectory over a fixed time interval Δ. Therefore, the relative pose error corresponds to the drift of the trajectory which is in particular useful for the evaluation of visual odometry systems. Let us define the relative pose error matrix at time step i as:


from a sequence of n camera poses we obtain m=n-Δ individual relative pose error matrices along the sequence. The RPE is usually divided into translation and rotation components. Similar to the absolute trajectory error, we propose to evaluate the root mean squared error over all time indicies for RPE translation error:


As for rotation component we use mean error approach:


We average over all possible pairs in both translation and rotation component.

IV-B Mapping

In this work we focus on estimation of 2D occupancy maps as they are commonly used for motion planning and navigation. We consider maps with two states of cells: “empty” and “occupied”. The task is formulated as follows. Given a sequences of inaccurate camera poses and raw RGB-D frames with noisy depth, the goal is to reconstruct 2D occupancy map of the visited part of a indoor scene.

Map scaling and alignment. To evaluate the quality of mapping result we need to align, scale and offset the predicted map with ground truth map. This task raises a challenge to compare both maps in the same coordinate system with appropriate scales and offsets.

Using the agent initial position and orientation in world coordinate system provided in the DISCOMAN dataset we transform all the frames from camera coordinate system to world coordinate system. Such transformation ensures to have all the point cloud extracted from frames are in the same coordinate system with ground truth.

To evaluate mapping results with ground truth map we further need to apply transformation to grid coordinate system. Then we project the point cloud to 2D coordinate system representing the predicted map. We provide transformation from world coordinate system to grid coordinate system as 4x4 matrix in the GRD file as part of DISCOMAN dataset. Also this matrix contains appropriate scale and offset for matching and centering a predicted map to ground truth map.

The described sequence of transformation ensures that the resulting map is aligned with the ground truth map. This removes potential artifacts that could arise from manipulating with images in 2D space such as map rotations and welding. This simplifies map quality evaluation and makes it more accurate.

Metrics. For evaluation of mapping results we use a modified version of Map Score metric introduced in [5]. Map score gives a positive value representing the difference between two maps (generally the ground truth map of the environment and the generated map that we are evaluating), so the lower the number, the more alike the two maps are. To normalise the score, we compute the worst possible map that could be compared to the ground truth map among the three variants: a map with inverted values of occupancy grid in the dilated occupied regions [5], an empty map, and fully occupied map. The value of Map Score for the evaluated map is then divided by the maximum of the Map Scores for these three maps.

IV-C Semantic/panoptic segmentation

We formulate the task as follows. Given an input sequence of frames one needs to predict the pixel-wise semantic/panoptic segmentation labelling for each frame. We perform evaluation across sequences, therefore the previous frames in the sequence can be utilized to achieve higher accuracy.

Ground truth annotation. Each RGB image in DISCOMAN comes along with corresponding pixel-wise semantic annotation. We used an ontology very similar to the one suggested in NYUv2 dataset [26]. The dataset provides annotation for both semantic and instance segmentation tasks. We also split the classes into ‘things‘ and ‘staff‘ and provide annotation for panoptic segmentation [20].

Metrics. In order to provide more diverse data for training semantic segmentation models we have generated additional dataset consisting of 60000 images taken from 12000 different scenes. For testing we use every 10th frame in the test sequences. For semantic segmentation we compute standard metrics, i.e. mIoU and pixel accuracy. For panoptic segmentation we compute PQ, SQ and RQ metrics as suggested in [20].

V Experiments

V-A Trajectory estimation

Method Success rate ATE RPE-t RPE-r (deg)
DSO (mono) 72% 2.59 8.11 93.84
LS-VO (mono) 100% 1.11 1.67 18.43
ORBSLAM2 (RGB-D) 11% 0.69 1.13 11.20
Motion Maps (RGB-D) 100% 0.82 1.17 14.25
Motion Maps - ORBSLAM2 sequences 100% 0.32 0.48 4.19
Motion Maps - DSO sequences 100% 0.42 0.52 5.9
TABLE II: Comparison of ORBSLAM2 [25], DSO [9], LS-VO [6] and Motion Maps [28] methods. We compute ATE, RPE rotation and RPE translation. DSO and ORBSLAM2 fail due to tracking loss on several sequences, therefore we report success rate for every method. We additionally report accuracy of Motion Maps method for the sequences where DSO or ORBSLAM2 succeeded.

Evaluated methods. We compute results for DSO (monocular) [9], ORBSLAM2 (RGB-D) [25] and recent learning-based LS-VO (monocular) [6] and Motion Maps (RGB-D) [28] method. We used author implementations for both evaluated methods in our experiments. In our experiments we used PWC-Net [32] for optical flow estimation in LS-VO and Motion Maps.

(a) (b) (c)
(d) (e) (f)
Fig. 5: Qualitative results of trajectory estimation. One can see that DISCOMAN dataset is difficult for sparse SLAM methods like DSO (monocular) and ORBSLAM2 (RGB-D). The main reasons for that are the abundance of fast rotations and low-textured surfaces, e.g. white walls. Learning-based methods LS-VO (monocular) and Motion Maps (RGB-D) show higher robustness, but in most cases lower accuracy.

Details and results. Since ORBSLAM2 is randomized, each test sequence in the dataset is processed 10 times to find the median value for each metric. We trained LS-VO and Motion Maps on the train part of the data with initial learning rate = 0.001 using Adam with default parameters (beta1 = 0.9, beta2 = 0.99). We used two separate L2 losses for translation and rotation components (Euler angles) of the motion with rotation loss multiplied by 50. The following LR scheduling was used: learning rate was multipled by 0.5 if validation loss does not decrease for 10 epochs. We have trained the model for 100 epochs on 1 GPU with batch size 128.

Results of the evaluation are shown in Table II. As DSO provides camera poses for every 3rd frame, we compute metrics using these frames only. Qualitative results are shown in Figure 5. Overall, both ORBSLAM2 and DSO are prone to tracking loss and demonstrate high failure rate. In many cases this problem arises in low-texture scenes, e.g. environments with white walls. But for the sequences where ORBSLAM2 succeeds, it demonstrates very good accuracy. In our experiments DSO showed high scale drift and often lost tracking. Learning-based methods are more robust and accurate.

V-B Mapping

Evaluated method. We chose Open3D [35] as a baseline algorithm. Open3D produces voxel maps from sequences of RGB-D frames. Taking color, depth and camera extrinsic and intrinsic matrices Open3D extracts point cloud from each frame, transforms it from camera coordinate system to world coordinate system and adds it to point cloud accumulator. We choose Open3D truncated signed distance function (TSDF) as data accumulator. TSDF speeds up point cloud aggregation and makes it much more uniform. It also allows scene to be represented with adjustable level of detail. We select scalable TSDF as more RAM-intelligent point cloud accumulator with resolution 0.03125 m and 0.25 m as truncation threshold.

At the last step Open3D transforms TSDF back to point cloud and we project it onto ground plane as it described earlier. As result we get a 2D predicted map in grid coordinate system and now are able to compare it with ground truth map.

For trajectory estimation we used Motion Maps method [28], as it showed the best performance. Example of a map produced by Open3d is shown in Figure 6. We believe that these results can be further improved by using further map optimizations, e.g. with the use of ICP, pose graph optimization or bundle adjustment. The quantitative results are shown in Table III.

(a) (b) (c) (d) (e)
Fig. 6: Example of mapping result obtained by Open3D using camera poses from Motion Maps method. (a) - occupancy grid of a 3d scene, (b) - occupancy grid obtained using Open3D with ground truth camera poses and ground truth depth, which we take for ground truth map, (c) - map from ground truth camera poses and noisy depth, (d) - map from camera poses provided by Motion Maps [28] and ground truth depth, (e) - map from poses from Motion Maps and noisy depth.
Method Success rate Map Score
Open3D (ground truth poses, noisy depth) 100% 87.1%
Open3D (est. poses, noisy depth) 50% 50.7%
TABLE III: Evaluation results for mapping. We present results for ground truth camera poses and for the camera poses estimated by Motion Maps, which showed highest accuracy in terms of trajectory estimation.

Details and results. To investigate the impact of different sources of errors on the accuracy of mapping we performed the following experiments. To evaluate the impact of depth noise on mapping we run our mapping evaluation pipeline on ground truth depth and on depth with emulated sensor noise. To evaluate the impact of inaccuracies in pose estimation we run experiments with ground truth camera positions/orientations and predicted ones. The results of the evaluation on DISCOMAN dataset are shown in Table III. One can see that inaccurate pose estimation and noisy depth measurements lead to degradation of accuracy.

V-C Semantic/panoptic segmentation

Evaluated methods. We perform experiments for both RGB and RGB-D semantic segmentation methods. For RGB segmentation we have reimplemented the state-of-the-art DeepLabV3+ architecture [4]. To enable RGB-D segmentation we trained the same architectures with added FuseNet-like [14] branch. For panoptic segmentation we used author’s implementation of AdaptIS [29].

Details and results. For semantic segmentation we trained the networks for 16 epochs with SGD momentum=0.9, weight decay 10-4, and linear learning rate scheduler starting with LR=0.01. We used ResNet101 [15] as a backbone and fine-tuned it with one tenth of the learning rate. Crop size was set to 440. To reduce overfitting we used the following augmentations: random flip, random scale up to 30% of the crop size, random crop, random blur. We trained the models on 2 GPUs with batch size = 8 in the experiments with RGB images and batch size = 6 in the experiments with RGB-D images respectively. For panoptic segmentation we used ResNet-50 as a backbone and trained the network for 180 epochs without point proposals and later for 20 more epochs with point proposals. We present the evaluation results in Table V.

Fig. 7: Failure cases for semantic segmentation. First row – input image, second row — ground truth semantic labelling, third row — result of DeepLabV3+ RGB segmentation, fourth row — result of DeepLabV3+ RGB-D segmentation. One can see that in some cases adding depth information helps to deal with ambiguities, but overall the effect of using depth for semantic segmentation is not dramatic.
Method mIoU pixel accuracy
DeepLabV3+ (RGB only) 77.41% 95.73%
DeepLabV3+ with FuseNet (RGB-D) 79.88% 96.11%
TABLE IV: Evaluation results for DeepLabV3+ [4] on DISCOMAN dataset. For RGB-D segmentation we added FuseNet-like branch [14] to DeepLabV3+ architecture.
All 50.22 83.27 57.18
Things 46.61 81.87 53.41
Stuff 62.59 88.05 70.10
TABLE V: Evaluation results for AdaptIS [29] on DISCOMAN dataset.

The results of our experiments are shown in Table IV. Qualitative results and failure cases are shown in Figure 7. One can notice that adding information about depth leads to slightly improved accuracy of semantic segmentation.

VI Conclusion

We have presented a new dataset and benchmark suite for training and evaluation of semantic SLAM models. This is the first large-scale dataset that provides ground truth annotation for environment maps in the form of occupancy grids. We present benchmarking results for RGB/RGB-D SLAM, mapping and semantic/panoptic segmentation methods across conventional metrics to establish baselines for further research.



  • [1] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese (2017-02) Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints. External Links: 1702.01105 Cited by: §II.
  • [2] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: TABLE I, §I, §II.
  • [3] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §II.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §I, §V-C, TABLE IV.
  • [5] T. Colleens and J. Colleens (2007) Occupancy grid mapping: an empirical evaluation. In 2007 Mediterranean Conference on Control & Automation, pp. 1–6. Cited by: §IV-B.
  • [6] G. Costante and T. A. Ciarfuglia (2018) Ls-vo: learning dense optical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters 3 (3), pp. 1735–1742. Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §I, §V-A, TABLE II.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes.. In CVPR, Vol. 2, pp. 10. Cited by: TABLE I, §I, §II.
  • [8] D. De Gregorio and L. Di Stefano (2017) Skimap: an efficient mapping framework for robot navigation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2569–2576. Cited by: §I.
  • [9] J. Engel, V. Koltun, and D. Cremers (2017) Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40 (3), pp. 611–625. Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §V-A, TABLE II.
  • [10] Ö. Erkent, C. Wolf, C. Laugier, D. S. González, and V. R. Cano (2018) Semantic grid estimation with a hybrid bayesian and deep neural network approach. In IROS, Cited by: §I.
  • [11] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349. Cited by: §II.
  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Fig. 2, §I, §I, §II.
  • [13] A. Handa, T. Whelan, J. McDonald, and A. J. Davison (2014) A benchmark for rgb-d visual odometry, 3d reconstruction and slam. In Robotics and automation (ICRA), 2014 IEEE international conference on, pp. 1524–1531. Cited by: TABLE I, §I, §II.
  • [14] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian conference on computer vision, pp. 213–228. Cited by: §V-C, TABLE IV.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §V-C.
  • [16] D. Hernandez-Juarez, L. Schneider, A. Espinosa, D. Vázquez, A. M. López, U. Franke, M. Pollefeys, and J. C. Moure (2017) Slanted stixels: representing san francisco’s steepest streets. arXiv preprint arXiv:1707.05397. Cited by: §I, §I.
  • [17] B. K. Horn (1987) Closed-form solution of absolute orientation using unit quaternions. JOSA A 4 (4), pp. 629–642. Cited by: §IV-A.
  • [18] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots 34 (3), pp. 189–206. Cited by: §I.
  • [19] Intel lab data. Note: Accessed: 2019-01-15 External Links: Link Cited by: §I.
  • [20] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9404–9413. Cited by: §IV-C, §IV-C.
  • [21] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716. Cited by: TABLE I, §I, §II.
  • [22] C. Lu, G. Dubbelman, and M. J. G. van de Molengraft (2018) Monocular semantic occupancy grid mapping with convolutional variational auto-encoders. CoRR abs/1804.02176. External Links: Link, 1804.02176 Cited by: §I.
  • [23] D. Maier, A. Hornung, and M. Bennewitz (2012) Real-time navigation in 3d environments based on depth camera data. In Humanoid Robots (Humanoids), 2012 12th IEEE-RAS International Conference on, pp. 692–697. Cited by: §I.
  • [24] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison (2017) Scenenet rgb-d: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In Proceedings of the International Conference on Computer Vision (ICCV), Vol. 4. Cited by: TABLE I, §I, §II.
  • [25] R. Mur-Artal and J. D. Tardós (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §I, §V-A, TABLE II.
  • [26] P. K. Nathan Silberman and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §IV-C.
  • [27] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stueckler, and D. Cremers (2018-10) The tum vi benchmark for evaluating visual-inertial odometry. In International Conference on Intelligent Robots and Systems (IROS), External Links: Link Cited by: TABLE I, §I, §II.
  • [28] I. Slinko, A. Vorontsova, F. Konokhov, O. Barinova, and A. Konushin (2019) Scene motion decomposition for learnable visual odometry. arXiv preprint arXiv:1907.07227. Cited by: Fig. 6, §V-A, §V-B, TABLE II.
  • [29] K. Sofiiuk, O. Barinova, and A. Konushin (2019) Adaptic: adaptive instance selection network. In Proceedings of the International Conference on Computer Vision, Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §V-C, TABLE V.
  • [30] S. Song, S. P. Lichtenberg, and J. Xiao (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576. Cited by: §I.
  • [31] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012-Oct.) A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Cited by: Fig. 2, TABLE I, §I, §I, §II.
  • [32] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §V-A.
  • [33] M. Zhang, K. T. Ma, S. Yen, J. Lim, Q. Zhao, and J. Feng (2018) Egocentric spatial memory. CoRR abs/1807.11929. External Links: Link, 1807.11929 Cited by: §I.
  • [34] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. Funkhouser (2017) Physically-based rendering for indoor scene understanding using convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5057–5065. Cited by: §I.
  • [35] Q. Zhou, J. Park, and V. Koltun (2018) Open3D: a modern library for 3d data processing. arXiv preprint arXiv:1801.09847. Cited by: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping And Navigation, §I, §V-B.