Abstract
Although cluttered indoor scenes have a lot of useful high-level semanticinformation which can be used for mapping and localization, most VisualOdometry (VO) algorithms rely on the usage of geometric features such aspoints, lines and planes. Lately, driven by this idea, the joint optimizationof semantic labels and obtaining odometry has gained popularity in the roboticscommunity. The joint optimization is good for accurate results but is generallyvery slow. At the same time, in the vision community, direct and sparseapproaches for VO have stricken the right balance between speed and accuracy. We merge the successes of these two communities and present a way toincorporate semantic information in the form of visual saliency to DirectSparse Odometry - a highly successful direct sparse VO algorithm. We alsopresent a framework to filter the visual saliency based on scene parsing. Ourframework, SalientDSO, relies on the widely successful deep learning basedapproaches for visual saliency and scene parsing which drives the featureselection for obtaining highly-accurate and robust VO even in the presence ofas few as 40 point features per frame. We provide extensive quantitativeevaluation of SalientDSO on the ICL-NUIM and TUM monoVO datasets and show thatwe outperform DSO and ORB-SLAM - two very popular state-of-the-art approachesin the literature. We also collect and publicly release a CVL-UMD dataset whichcontains two indoor cluttered sequences on which we show qualitativeevaluations. To our knowledge this is the first paper to use visual saliencyand scene parsing to drive the feature selection in direct VO.