PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Abstract

This paper explores a better codebook for BERT pre-training of visiontransformers. The recent work BEiT successfully transfers BERT pre-trainingfrom NLP to the vision field. It directly adopts one simple discrete VAE as thevisual tokenizer, but has not considered the semantic level of the resultingvisual tokens. By contrast, the discrete tokens in NLP field are naturallyhighly semantic. This difference motivates us to learn a perceptual codebook.And we surprisingly find one simple yet effective idea: enforcing perceptualsimilarity during the dVAE training. We demonstrate that the visual tokensgenerated by the proposed perceptual codebook do exhibit better semanticmeanings, and subsequently help pre-training achieve superior transferperformance in various downstream tasks. For example, we achieve 84.5 Top-1accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitivemethod BEiT by +1.3 with the same pre-training epochs. It can also improve theperformance of object detection and segmentation tasks on COCO val by +1.3 boxAP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU, The code andmodels will be available at \url{https://github.com/microsoft/PeCo}.

Quick Read (beta)

loading the full paper ...