Abstract
Vision-and-Language Navigation (VLN) presents a complex challenge in embodiedAI, requiring agents to interpret natural language instructions and navigatethrough visually rich, unfamiliar environments. Recent advances in largevision-language models (LVLMs), such as CLIP and Flamingo, have significantlyimproved multimodal understanding but introduced new challenges related tocomputational cost and real-time deployment. In this project, we propose amodular, plug-and-play navigation framework that decouples vision-languageunderstanding from action planning. By integrating a frozen vision-languagemodel, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim toachieve flexible, fast, and adaptable navigation without extensive modelfine-tuning. Our framework leverages prompt engineering, structured historymanagement, and a two-frame visual input strategy to enhance decision-makingcontinuity across navigation steps. We evaluate our system on the Room-to-Roombenchmark within the VLN-CE setting using the Matterport3D dataset andHabitat-Lab simulation environment. Although our initial results revealchallenges in generalizing to unseen environments under strict evaluationsettings, our modular approach lays a foundation for scalable and efficientnavigation systems, highlighting promising directions for future improvementthrough enhanced environmental priors and expanded multimodal inputintegration.