Efficient Vision-Language Pre-training by Cluster Masking

Abstract

We propose a simple strategy for masking image patches during visual-languagecontrastive learning that improves the quality of the learned representationsand the training speed. During each iteration of training, we randomly maskclusters of visually similar image patches, as measured by their raw pixelintensities. This provides an extra learning signal, beyond the contrastivetraining itself, since it forces a model to predict words for masked visualstructures solely from context. It also speeds up training by reducing theamount of data used in each image. We evaluate the effectiveness of our modelby pre-training on a number of benchmarks, finding that it outperforms othermasking strategies, such as FLIP, on the quality of the learned representation.

Quick Read (beta)

loading the full paper ...