Abstract
Building on the advancements of Large Language Models (LLMs) and VisionLanguage Models (VLMs), recent research has introduced Vision-Language-Action(VLA) models as an integrated solution for robotic manipulation tasks. Thesemodels take camera images and natural language task instructions as input anddirectly generate control actions for robots to perform specified tasks,greatly improving both decision-making capabilities and interaction with humanusers. However, the data-driven nature of VLA models, combined with their lackof interpretability, makes the assurance of their effectiveness and robustnessa challenging task. This highlights the need for a reliable testing andevaluation platform. For this purpose, in this work, we propose LADEV, acomprehensive and efficient platform specifically designed for evaluating VLAmodels. We first present a language-driven approach that automaticallygenerates simulation environments from natural language inputs, mitigating theneed for manual adjustments and significantly improving testing efficiency.Then, to further assess the influence of language input on the VLA models, weimplement a paraphrase mechanism that produces diverse natural language taskinstructions for testing. Finally, to expedite the evaluation process, weintroduce a batch-style method for conducting large-scale testing of VLAmodels. Using LADEV, we conducted experiments on several state-of-the-art VLAmodels, demonstrating its effectiveness as a tool for evaluating these models.Our results showed that LADEV not only enhances testing efficiency but alsoestablishes a solid baseline for evaluating VLA models, paving the way for thedevelopment of more intelligent and advanced robotic systems.