Interactive Post-Training for Vision-Language-Action Models

Abstract

We introduce RIPT-VLA, a simple and scalable reinforcement-learning-basedinteractive post-training paradigm that fine-tunes pretrainedVision-Language-Action (VLA) models using only sparse binary success rewards.Existing VLA training pipelines rely heavily on offline expert demonstrationdata and supervised imitation, limiting their ability to adapt to new tasks andenvironments under low-data regimes. RIPT-VLA addresses this by enablinginteractive post-training with a stable policy optimization algorithm based ondynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLAmodels, resulting in an improvement on the lightweight QueST model by 21.2%,and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, itis computationally efficient and data-efficient: with only one demonstration,RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% successrate within 15 iterations. Furthermore, we demonstrate that the policy learnedby RIPT-VLA generalizes across different tasks and scenarios and is robust tothe initial state context. These results highlight RIPT-VLA as a practical andeffective paradigm for post-training VLA models through minimal supervision.

Quick Read (beta)

loading the full paper ...