Labeling videos at scale is impractical. Consequently, self-supervised visualrepresentation learning is key for efficient video analysis. Recent success inlearning image representations suggests contrastive learning is a promisingframework to tackle this challenge. However, when applied to real-world videos,contrastive learning may unknowingly lead to the separation of instances thatcontain semantically similar events. In our work, we introduce a cooperativevariant of contrastive learning to utilize complementary information acrossviews and address this issue. We use data-driven sampling to leverage implicitrelationships between multiple input video views, whether observed (e.g. RGB)or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts toexplore exploiting inter-instance relationships to drive learning. Weexperimentally evaluate our representations on the downstream task of actionrecognition. Our method achieves competitive performance on standard benchmarks(UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustratethat our models can capture higher-order class relationships.