Self-supervised learning has shown superior performances over supervisedmethods on various vision benchmarks. The siamese network, which encouragesembeddings to be invariant to distortions, is one of the most successfulself-supervised visual representation learning approaches. Among all theaugmentation methods, masking is the most general and straightforward methodthat has the potential to be applied to all kinds of input and requires theleast amount of domain knowledge. However, masked siamese networks requireparticular inductive bias and practically only work well with VisionTransformers. This work empirically studies the problems behind masked siamesenetworks with ConvNets. We propose several empirical designs to overcome theseproblems gradually. Our method performs competitively on low-shot imageclassification and outperforms previous methods on object detection benchmarks.We discuss several remaining issues and hope this work can provide useful datapoints for future general-purpose self-supervised learning.