Abstract
Vision-Language-Action (VLA) models demonstrate significant potential fordeveloping generalized policies in real-world robotic control. This progressinspires researchers to explore fine-tuning these models with ReinforcementLearning (RL). However, fine-tuning VLA models with RL still faces challengesrelated to sample efficiency, compatibility with action chunking, and trainingstability. To address these challenges, we explore the fine-tuning of VLAmodels through offline reinforcement learning incorporating action chunking. Inthis work, we propose Chunked RL, a novel reinforcement learning frameworkspecifically designed for VLA models. Within this framework, we extend temporaldifference (TD) learning to incorporate action chunking, a prominentcharacteristic of VLA models. Building upon this framework, we propose CO-RFT,an algorithm aimed at fine-tuning VLA models using a limited set ofdemonstrations (30 to 60 samples). Specifically, we first conduct imitationlearning (IL) with full parameter fine-tuning to initialize both the backboneand the policy. Subsequently, we implement offline RL with action chunking tooptimize the pretrained policy. Our empirical results in real-worldenvironments demonstrate that CO-RFT outperforms previous supervised methods,achieving a 57% improvement in success rate and a 22.3% reduction in cycletime. Moreover, our method exhibits robust positional generalizationcapabilities, attaining a success rate of 44.3% in previously unseen positions.