Research on Inverse Reinforcement Learning (IRL) from third-person videos hasshown encouraging results on removing the need for manual reward design forrobotic tasks. However, most prior works are still limited by training from arelatively restricted domain of videos. In this paper, we argue that the truepotential of third-person IRL lies in increasing the diversity of videos forbetter scaling. To learn a reward function from diverse videos, we propose toperform graph abstraction on the videos followed by temporal matching in thegraph space to measure the task progress. Our insight is that a task can bedescribed by entity interactions that form a graph, and this graph abstractioncan help remove irrelevant information such as textures, resulting in morerobust reward functions. We evaluate our approach, GraphIRL, oncross-embodiment learning in X-MAGICAL and learning from human demonstrationsfor real-robot manipulation. We show significant improvements in robustness todiverse video demonstrations over previous approaches, and even achieve betterresults than manual reward design on a real robot pushing task. Videos areavailable at https://sateeshkumar21.github.io/GraphIRL .