Visual Pre-Training on Unlabeled Images using Reinforcement Learning

Abstract

In reinforcement learning (RL), value-based algorithms learn to associateeach observation with the states and rewards that are likely to be reached fromit. We observe that many self-supervised image pre-training methods bearsimilarity to this formulation: learning features that associate crops ofimages with those of nearby views, e.g., by taking a different crop or coloraugmentation. In this paper, we complete this analogy and explore a method thatdirectly casts pre-training on unlabeled image data like web crawls and videoframes as an RL problem. We train a general value function in a dynamicalsystem where an agent transforms an image by changing the view or adding imageaugmentations. Learning in this way resembles crop-consistencyself-supervision, but through the reward function, offers a simple lever toshape feature learning using curated images or weakly labeled captions whenthey exist. Our experiments demonstrate improved representations when trainingon unlabeled images in the wild, including video data like EpicKitchens, scenedata like COCO, and web-crawl data like CC12M.

Quick Read (beta)

loading the full paper ...