Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration

Abstract

This paper addresses the limitations of current humanoid robot controlframeworks, which primarily rely on reactive mechanisms and lack autonomousinteraction capabilities due to data scarcity. We propose Humanoid-VLA, a novelframework that integrates language understanding, egocentric scene perception,and motion control, enabling universal humanoid control. Humanoid-VLA beginswith language-motion pre-alignment using non-egocentric human motion datasetspaired with textual descriptions, allowing the model to learn universal motionpatterns and action semantics. We then incorporate egocentric visual contextthrough a parameter efficient video-conditioned fine-tuning, enablingcontext-aware motion generation. Furthermore, we introduce a self-superviseddata augmentation strategy that automatically generates pseudoannotationsdirectly derived from motion data. This process converts raw motion sequencesinto informative question-answer pairs, facilitating the effective use oflarge-scale unlabeled video data. Built upon whole-body control architectures,extensive experiments show that Humanoid-VLA achieves object interaction andenvironment exploration tasks with enhanced contextual awareness, demonstratinga more human-like capacity for adaptive and intelligent engagement.

Quick Read (beta)

loading the full paper ...