Abstract
Vision-Language-Action (VLA) models have emerged as a powerful framework thatunifies perception, language, and control, enabling robots to perform diversetasks through multimodal understanding. However, current VLA models typicallycontain massive parameters and rely heavily on large-scale robot datapretraining, leading to high computational costs during training, as well aslimited deployability for real-time inference. Moreover, most trainingparadigms often degrade the perceptual representations of the vision-languagebackbone, resulting in overfitting and poor generalization to downstream tasks.In this work, we present Evo-1, a lightweight VLA model that reducescomputation and improves deployment efficiency, while maintaining strongperformance without pretraining on robot data. Evo-1 builds on a nativemultimodal Vision-Language model (VLM), incorporating a novel cross-modulateddiffusion transformer along with an optimized integration module, togetherforming an effective architecture. We further introduce a two-stage trainingparadigm that progressively aligns action with perception, preserving therepresentations of the VLM. Notably, with only 0.77 billion parameters, Evo-1achieves state-of-the-art results on the Meta-World and RoboTwin suite,surpassing the previous best models by 12.4% and 6.9%, respectively, and alsoattains a competitive result of 94.8% on LIBERO. In real-world evaluations,Evo-1 attains a 78% success rate with high inference frequency and low memoryoverhead, outperforming all baseline methods. We release code, data, and modelweights to facilitate future research on lightweight and efficient VLA models.