Abstract
Vision-Language-Action (VLA) models have recently made significant advance inmulti-task, end-to-end robotic control, due to the strong generalizationcapabilities of Vision-Language Models (VLMs). A fundamental challenge indeveloping such models is effectively aligning the vision-language space withthe robotic action space. Existing approaches typically rely on directlyfine-tuning VLMs using expert demonstrations. However, this strategy suffersfrom a spatio-temporal gap, resulting in considerable data inefficiency andheavy reliance on human labor. Spatially, VLMs operate within a high-levelsemantic space, whereas robotic actions are grounded in low-level 3D physicalspace; temporally, VLMs primarily interpret the present, while VLA modelsanticipate future actions. To overcome these challenges, we propose a noveltraining paradigm, ROSA, which leverages robot state estimation to improvealignment between vision-language and action spaces. By integrating robot stateestimation data obtained via an automated process, ROSA enables the VLA modelto gain enhanced spatial understanding and self-awareness, thereby boostingperformance and generalization. Extensive experiments in both simulated andreal-world environments demonstrate the effectiveness of ROSA, particularly inlow-data regimes.