SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Abstract

Recent advances in vision-language navigation (VLN) were mainly attributed toemerging large language models (LLMs). These methods exhibited excellentgeneralization capabilities in instruction understanding and task reasoning.However, they were constrained by the fixed knowledge bases and reasoningabilities of LLMs, preventing fully incorporating experiential knowledge andthus resulting in a lack of efficient evolutionary capacity. To address this,we drew inspiration from the evolution capabilities of natural agents, andproposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with theability to continuously evolve during testing. To the best of our knowledge, itwas the first time that an multimodal LLM-powered self-evolving VLN frameworkwas proposed. Specifically, SE-VLN comprised three core modules, i.e., ahierarchical memory module to transfer successful and failure cases intoreusable knowledge, a retrieval-augmented thought-based reasoning module toretrieve experience and enable multi-step decision-making, and a reflectionmodule to realize continual evolution. Comprehensive tests illustrated that theSE-VLN achieved navigation success rates of 57% and 35.2% in unseenenvironments, representing absolute performance improvements of 23.9% and 15.0%over current state-of-the-art methods on R2R and REVERSE datasets,respectively. Moreover, the SE-VLN showed performance improvement withincreasing experience repository, elucidating its great potential as aself-evolving agent framework for VLN.

Quick Read (beta)

loading the full paper ...