Abstract
This paper presents a new perspective of self-supervised learning based onextending heat equation into high dimensional feature space. In particular, weremove time dependence by steady-state condition, and extend the remaining 2DLaplacian from x--y isotropic to linear correlated. Furthermore, we simplify itby splitting x and y axes as two first-order linear differential equations.Such simplification explicitly models the spatial invariance along horizontaland vertical directions separately, supporting prediction across image blocks.This introduces a very simple masked image modeling (MIM) method, namedQB-Heat. QB-Heat leaves a single block with size of quarter image unmasked andextrapolates other three masked quarters linearly. It brings MIM to CNNswithout bells and whistles, and even works well for pre-training light-weightnetworks that are suitable for both image classification and object detectionwithout fine-tuning. Compared with MoCo-v2 on pre-training a Mobile-Former with5.8M parameters and 285M FLOPs, QB-Heat is on par in linear probing onImageNet, but clearly outperforms in non-linear probing that adds a transformerblock before linear classifier (65.6% vs. 52.9%). When transferring to objectdetection with frozen backbone, QB-Heat outperforms MoCo-v2 and supervisedpre-training on ImageNet by 7.9 and 4.5 AP respectively. This work provides an insightful hypothesis on the invariance within visualrepresentation over different shapes and textures: the linear relationshipbetween horizontal and vertical derivatives. The code will be publiclyreleased.