Learning Effective RGB-D Representations for Scene Recognition

Abstract

Deep convolutional networks (CNN) can achieve impressive results on RGB scenerecognition thanks to large datasets such as Places. In contrast, RGB-D scenerecognition is still underdeveloped in comparison, due to two limitations ofRGB-D data we address in this paper. The first limitation is the lack of depthdata for training deep learning models. Rather than fine tuning or transferringRGB-specific features, we address this limitation by proposing an architectureand a two-step training approach that directly learns effective depth-specificfeatures using weak supervision via patches. The resulting RGB-D model alsobenefits from more complementary multimodal features. Another limitation is theshort range of depth sensors (typically 0.5m to 5.5m), resulting in depthimages not capturing distant objects in the scenes that RGB images can. We showthat this limitation can be addressed by using RGB-D videos, where morecomprehensive depth information is accumulated as the camera travels across thescene. Focusing on this scenario, we introduce the ISIA RGB-D video dataset toevaluate RGB-D scene recognition with videos. Our video recognitionarchitecture combines convolutional and recurrent neural networks (RNNs) thatare trained in three steps with increasingly complex data to learn effectivefeatures (i.e. patches, frames and sequences). Our approach obtainsstate-of-the-art performances on RGB-D image (NYUD2 and SUN RGB-D) and video(ISIA RGB-D) scene recognition.

Quick Read (beta)

loading the full paper ...