Vision-language navigation (VLN) is the task of navigating an embodied agentto carry out natural language instructions inside real 3D environments. In thispaper, we study how to address three critical challenges for this task: thecross-modal grounding, the ill-posed feedback, and the generalization problems.First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach thatenforces cross-modal grounding both locally and globally via reinforcementlearning (RL). Particularly, a matching critic is used to provide an intrinsicreward to encourage global matching between instructions and trajectories, anda reasoning navigator is employed to perform cross-modal grounding in the localvisual scene. Evaluation on a VLN benchmark dataset shows that our RCM modelsignificantly outperforms existing methods by 10% on SPL and achieves the newstate-of-the-art performance. To improve the generalizability of the learnedpolicy, we further introduce a Self-Supervised Imitation Learning (SIL) methodto explore unseen environments by imitating its own past, good decisions. Wedemonstrate that SIL can approximate a better and more efficient policy, whichtremendously minimizes the success rate performance gap between seen and unseenenvironments (from 30.7% to 11.7%).