The objective of this paper is self-supervised representation learning, withthe goal of solving semi-supervised video object segmentation (a.k.a. densetracking). We make the following contributions: (i) we propose to improve theexisting self-supervised approach, with a simple, yet more effective memorymechanism for long-term correspondence matching, which resolves the challengecaused by the dis-appearance and reappearance of objects; (ii) by augmentingthe self-supervised approach with an online adaptation module, our methodsuccessfully alleviates tracker drifts caused by spatial-temporaldiscontinuity, e.g. occlusions or dis-occlusions, fast motions; (iii) weexplore the efficiency of self-supervised representation learning for densetracking, surprisingly, we show that a powerful tracking model can be trainedwith as few as 100 raw video clips (equivalent to a duration of 11mins),indicating that low-level statistics have already been effective for trackingtasks; (iv) we demonstrate state-of-the-art results among the self-supervisedapproaches on DAVIS-2017 and YouTube-VOS, as well as surpassing most of methodstrained with millions of manual segmentation annotations, further bridging thegap between self-supervised and supervised learning. Codes are released tofoster any further research (https://github.com/fangruizhu/self_sup_semiVOS).