From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Abstract

One promise that Vision-Language-Action (VLA) models hold over traditionalimitation learning for robotics is to leverage the broad generalizationcapabilities of large Vision-Language Models (VLMs) to produce versatile,"generalist" robot policies. However, current evaluations of VLAs remaininsufficient. Traditional imitation learning benchmarks are unsuitable due tothe lack of language instructions. Emerging benchmarks for VLAs thatincorporate language often come with limited evaluation tasks and do not intendto investigate how much VLM pretraining truly contributes to the generalizationcapabilities of the downstream robotic policy. Meanwhile, much research relieson real-world robot setups designed in isolation by different institutions,which creates a barrier for reproducibility and accessibility. To address thisgap, we introduce a unified probing suite of 50 simulation-based tasks across10 subcategories spanning language instruction, vision, and objects. Wesystematically evaluate several state-of-the-art VLA architectures on thissuite to understand their generalization capability. Our results show thatwhile VLM backbones endow VLAs with robust perceptual understanding and highlevel planning, which we refer to as good intentions, this does not reliablytranslate into precise motor execution: when faced with out-of-distributionobservations, policies often exhibit coherent intentions, but falter in actionexecution. Moreover, finetuning on action data can erode the original VLM'sgeneralist reasoning abilities. We release our task suite and evaluation codeto serve as a standardized benchmark for future VLAs and to drive research onclosing the perception-to-action gap. More information, including the sourcecode, can be found at https://ai4ce.github.io/INT-ACT/

Quick Read (beta)

loading the full paper ...