Abstract
This paper addresses the limitations of current humanoid robot controlframeworks, which primarily rely on reactive mechanisms and lack autonomousinteraction capabilities due to data scarcity. We propose Humanoid-VLA, a novelframework that integrates language understanding, egocentric scene perception,and motion control, enabling universal humanoid control. Humanoid-VLA beginswith language-motion pre-alignment using non-egocentric human motion datasetspaired with textual descriptions, allowing the model to learn universal motionpatterns and action semantics. We then incorporate egocentric visual contextthrough a parameter efficient video-conditioned fine-tuning, enablingcontext-aware motion generation. Furthermore, we introduce a self-superviseddata augmentation strategy that automatically generates pseudoannotationsdirectly derived from motion data. This process converts raw motion sequencesinto informative question-answer pairs, facilitating the effective use oflarge-scale unlabeled video data. Built upon whole-body control architectures,extensive experiments show that Humanoid-VLA achieves object interaction andenvironment exploration tasks with enhanced contextual awareness, demonstratinga more human-like capacity for adaptive and intelligent engagement.