Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful framework thatunifies perception, language, and control, enabling robots to perform diversetasks through multimodal understanding. However, current VLA models typicallycontain massive parameters and rely heavily on large-scale robot datapretraining, leading to high computational costs during training, as well aslimited deployability for real-time inference. Moreover, most trainingparadigms often degrade the perceptual representations of the vision-languagebackbone, resulting in overfitting and poor generalization to downstream tasks.In this work, we present Evo-1, a lightweight VLA model that reducescomputation and improves deployment efficiency, while maintaining strongperformance without pretraining on robot data. Evo-1 builds on a nativemultimodal Vision-Language model (VLM), incorporating a novel cross-modulateddiffusion transformer along with an optimized integration module, togetherforming an effective architecture. We further introduce a two-stage trainingparadigm that progressively aligns action with perception, preserving therepresentations of the VLM. Notably, with only 0.77 billion parameters, Evo-1achieves state-of-the-art results on the Meta-World and RoboTwin suite,surpassing the previous best models by 12.4% and 6.9%, respectively, and alsoattains a competitive result of 94.8% on LIBERO. In real-world evaluations,Evo-1 attains a 78% success rate with high inference frequency and low memoryoverhead, outperforming all baseline methods. We release code, data, and modelweights to facilitate future research on lightweight and efficient VLA models.

Quick Read (beta)

loading the full paper ...