Abstract
Training robots directly from human videos is an emerging area in roboticsand computer vision. While there has been notable progress with two-fingeredgrippers, learning autonomous tasks for multi-fingered robot hands in this wayremains challenging. A key reason for this difficulty is that a policy trainedon human hands may not directly transfer to a robot hand due to morphologydifferences. In this work, we present HuDOR, a technique that enables onlinefine-tuning of policies by directly computing rewards from human videos.Importantly, this reward function is built using object-oriented trajectoriesderived from off-the-shelf point trackers, providing meaningful learningsignals despite the morphology gap and visual differences between human androbot hands. Given a single video of a human solving a task, such as gentlyopening a music box, HuDOR enables our four-fingered Allegro hand to learn thetask with just an hour of online interaction. Our experiments across four tasksshow that HuDOR achieves a 4x improvement over baselines. Code and videos areavailable on our website, https://object-rewards.github.io.