Contrastive self-supervised learning has shown impressive results in learningvisual representations from unlabeled images by enforcing invariance againstdifferent data augmentations. However, the learned representations are oftencontextually biased to the spurious scene correlations of different objects orobject and background, which may harm their generalization on the downstreamtasks. To tackle the issue, we develop a novel object-aware contrastivelearning framework that first (a) localizes objects in a self-supervised mannerand then (b) debias scene correlations via appropriate data augmentationsconsidering the inferred object locations. For (a), we propose the contrastiveclass activation map (ContraCAM), which finds the most discriminative regions(e.g., objects) in the image compared to the other images using thecontrastively trained models. We further improve the ContraCAM to detectmultiple objects and entire shapes via an iterative refinement procedure. For(b), we introduce two data augmentations based on ContraCAM, object-awarerandom crop and background mixup, which reduce contextual and background biasesduring contrastive self-supervised learning, respectively. Our experimentsdemonstrate the effectiveness of our representation learning framework,particularly when trained under multi-object images or evaluated under thebackground (and distribution) shifted images.