Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Abstract

In the Vision-and-Language Navigation task, the embodied agent followslinguistic instructions and navigates to a specific goal. It is important inmany practical scenarios and has attracted extensive attention from bothcomputer vision and robotics communities. However, most existing works only useRGB images but neglect the 3D semantic information of the scene. To this end,we develop a novel self-supervised training framework to encode the voxel-level3D semantic reconstruction into a 3D semantic representation. Specifically, aregion query task is designed as the pretext task, which predicts the presenceor absence of objects of a particular class in a specific 3D region. Then, weconstruct an LSTM-based navigation model and train it with the proposed 3Dsemantic representations and BERT language features on vision-language pairs.Experiments show that the proposed approach achieves success rates of 68% and66% on the validation unseen and test unseen splits of the R2R datasetrespectively, which are superior to most of RGB-based methods utilizingvision-language transformers.

Quick Read (beta)

loading the full paper ...