Extreme Masking for Learning Instance and Distributed Visual Representations

Abstract

The paper presents a scalable approach for learning distributedrepresentations over individual tokens and a holistic instance representationsimultaneously. We use self-attention blocks to represent distributed tokens,followed by cross-attention blocks to aggregate the holistic instance. The coreof the approach is the use of extremely large token masking (75%-90%) as thedata augmentation for supervision. Our model, named ExtreMA, follows the plainBYOL approach where the instance representation from the unmasked subset istrained to predict that from the intact input. Learning requires the model tocapture informative variations in an instance, instead of encouraginginvariances. The paper makes three contributions: 1) Random masking is a strongand computationally efficient data augmentation for learning generalizableattention representations. 2) With multiple sampling per instance, extrememasking greatly speeds up learning and hungers for more data. 3) Distributedrepresentations can be learned from the instance supervision alone, unlikeper-token supervisions in masked modeling.

Quick Read (beta)

loading the full paper ...