Abstract
This paper demonstrates an approach for learning highly semantic imagerepresentations without relying on hand-crafted data-augmentations. Weintroduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), anon-generative approach for self-supervised learning from images. The ideabehind I-JEPA is simple: from a single context block, predict therepresentations of various target blocks in the same image. A core designchoice to guide I-JEPA towards producing semantic representations is themasking strategy; specifically, it is crucial to (a) predict several targetblocks in the image, (b) sample target blocks with sufficiently large scale(occupying 15%-20% of the image), and (c) use a sufficiently informative(spatially distributed) context block. Empirically, when combined with VisionTransformers, we find I-JEPA to be highly scalable. For instance, we train aViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strongdownstream performance across a wide range of tasks requiring various levels ofabstraction, from linear classification to object counting and depthprediction.