Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Abstract

This paper presents a novel approach for pretraining robotic manipulationVision-Language-Action (VLA) models using a large corpus of unscriptedreal-life video recordings of human hand activities. Treating human hand asdexterous robot end-effector, we show that "in-the-wild" egocentric humanvideos without any annotations can be transformed into data formats fullyaligned with existing robotic V-L-A training data in terms of task granularityand labels. This is achieved by the development of a fully-automated holistichuman activity analysis approach for arbitrary human hand videos. This approachcan generate atomic-level hand activity segments and their languagedescriptions, each accompanied with framewise 3D hand motion and camera motion.We process a large volume of egocentric videos and create a hand-VLA trainingdataset containing 1M episodes and 26M frames. This training data covers a widerange of objects and concepts, dexterous manipulation tasks, and environmentvariations in real life, vastly exceeding the coverage of existing robot data.We design a dexterous hand VLA model architecture and pretrain the model onthis dataset. The model exhibits strong zero-shot capabilities on completelyunseen real-world observations. Additionally, fine-tuning it on a small amountof real robot action data significantly improves task success rates andgeneralization to novel objects in real robotic experiments. We alsodemonstrate the appealing scaling behavior of the model's task performance withrespect to pretraining data scale. We believe this work lays a solid foundationfor scalable VLA pretraining, advancing robots toward truly generalizableembodied intelligence.

Quick Read (beta)

loading the full paper ...