Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Abstract

Human vision is a highly active process driven by gaze, which directsattention and fixation to task-relevant regions and dramatically reduces visualprocessing. In contrast, robot learning systems typically rely on passive,uniform processing of raw camera images. In this work, we explore howincorporating human-like active gaze into robotic policies can enhance bothefficiency and performance. We build on recent advances in foveated imageprocessing and apply them to an Active Vision robot system that emulates bothhuman head movement and eye tracking. Extending prior work on the AV-ALOHArobot simulation platform, we introduce a framework for simultaneouslycollecting eye-tracking data and robot demonstrations from a human operator aswell as a simulation benchmark and dataset for training robot policies thatincorporate human gaze. Given the widespread use of Vision Transformers (ViTs)in robot learning, we integrate gaze information into ViTs using a foveatedpatch tokenization scheme inspired by recent work in image segmentation.Compared to uniform patch tokenization, this significantly reduces the numberof tokens-and thus computation-without sacrificing visual fidelity near regionsof interest. We also explore two approaches to gaze imitation and predictionfrom human data. The first is a two-stage model that predicts gaze to guidefoveation and action; the second integrates gaze into the action space,allowing the policy to jointly predict gaze and actions end-to-end. Our resultsshow that our method for foveated robot vision not only drastically reducescomputational overhead, but also improves performance for high precision tasksand robustness to unseen distractors. Together, these findings suggest thathuman-inspired visual processing offers a useful inductive bias for roboticvision systems. https://ian-chuang.github.io/gaze-av-aloha/

Quick Read (beta)

loading the full paper ...