Abstract
We report our recent progress towards building generalist robot policies, thedevelopment of GR-3. GR-3 is a large-scale vision-language-action (VLA) model.It showcases exceptional capabilities in generalizing to novel objects,environments, and instructions involving abstract concepts. Furthermore, it canbe efficiently fine-tuned with minimal human trajectory data, enabling rapidand cost-effective adaptation to new settings. GR-3 also excels in handlinglong-horizon and dexterous tasks, including those requiring bi-manualmanipulation and mobile movement, showcasing robust and reliable performance.These capabilities are achieved through a multi-faceted training recipe thatincludes co-training with web-scale vision-language data, efficient fine-tuningfrom human trajectory data collected via VR devices, and effective imitationlearning with robot trajectory data. In addition, we introduce ByteMini, aversatile bi-manual mobile robot designed with exceptional flexibility andreliability, capable of accomplishing a wide range of tasks when integratedwith GR-3. Through extensive real-world experiments, we show GR-3 surpasses thestate-of-the-art baseline method, $\pi_0$, on a wide variety of challengingtasks. We hope GR-3 can serve as a step towards building generalist robotscapable of assisting humans in daily life.