Abstract
A range of video modeling tasks, from optical flow to multiple objecttracking, share the same fundamental challenge: establishing space-timecorrespondence. Yet, approaches that dominate each space differ. We take a steptowards bridging this gap by extending the recent contrastive random walkformulation to much denser, pixel-level space-time graphs. The maincontribution is introducing hierarchy into the search problem by computing thetransition matrix between two frames in a coarse-to-fine manner, forming amultiscale contrastive random walk when extended in time. This establishes aunified technique for self-supervised learning of optical flow, keypointtracking, and video object segmentation. Experiments demonstrate that, for eachof these tasks, the unified model achieves performance competitive with strongself-supervised approaches specific to that task. Project site:https://jasonbian97.github.io/flowwalk