Abstract
Given new tasks with very little data--such as new classes in aclassification problem or a domain shift in the input--performance of modernvision systems degrades remarkably quickly. In this work, we illustrate how theneural network representations which underpin modern vision systems are subjectto supervision collapse, whereby they lose any information that is notnecessary for performing the training task, including information that may benecessary for transfer to new tasks or domains. We then propose two methods tomitigate this problem. First, we employ self-supervised learning to encouragegeneral-purpose features that transfer better. Second, we propose a novelTransformer based neural network architecture called CrossTransformers, whichcan take a small number of labeled images and an unlabeled query, find coarsespatial correspondence between the query and the labeled images, and then inferclass membership by computing distances between spatially-correspondingfeatures. The result is a classifier that is more robust to task and domainshift, which we demonstrate via state-of-the-art performance on Meta-Dataset, arecent dataset for evaluating transfer from ImageNet to many other visiondatasets.