Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

  • 2023-01-19 18:59:01
  • Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas
  • 33


This paper demonstrates an approach for learning highly semantic imagerepresentations without relying on hand-crafted data-augmentations. Weintroduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), anon-generative approach for self-supervised learning from images. The ideabehind I-JEPA is simple: from a single context block, predict therepresentations of various target blocks in the same image. A core designchoice to guide I-JEPA towards producing semantic representations is themasking strategy; specifically, it is crucial to (a) predict several targetblocks in the image, (b) sample target blocks with sufficiently large scale(occupying 15%-20% of the image), and (c) use a sufficiently informative(spatially distributed) context block. Empirically, when combined with VisionTransformers, we find I-JEPA to be highly scalable. For instance, we train aViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strongdownstream performance across a wide range of tasks requiring various levels ofabstraction, from linear classification to object counting and depthprediction.


Quick Read (beta)

loading the full paper ...