Improving Vision-Language-Action Model with Online Reinforcement Learning

Abstract

Recent studies have successfully integrated large vision-language models(VLMs) into low-level robotic control by supervised fine-tuning (SFT) withexpert robotic datasets, resulting in what we term vision-language-action (VLA)models. Although the VLA models are powerful, how to improve these large modelsduring interaction with environments remains an open question. In this paper,we explore how to further improve these VLA models via Reinforcement Learning(RL), a commonly used fine-tuning technique for large models. However, we findthat directly applying online RL to large VLA models presents significantchallenges, including training instability that severely impacts theperformance of large models, and computing burdens that exceed the capabilitiesof most local machines. To address these challenges, we propose iRe-VLAframework, which iterates between Reinforcement Learning and SupervisedLearning to effectively improve VLA models, leveraging the exploratory benefitsof RL while maintaining the stability of supervised learning. Experiments intwo simulated benchmarks and a real-world manipulation suite validate theeffectiveness of our method.

Quick Read (beta)

loading the full paper ...