A large part of the current success of deep learning lies in theeffectiveness of data -- more precisely: labelled data. Yet, labelling adataset with human annotation continues to carry high costs, especially forvideos. While in the image domain, recent methods have allowed to generatemeaningful (pseudo-) labels for unlabelled datasets without supervision, thisdevelopment is missing for the video domain where learning featurerepresentations is the current focus. In this work, we a) show thatunsupervised labelling of a video dataset does not come for free from strongfeature encoders and b) propose a novel clustering method that allowspseudo-labelling of a video dataset without any human annotations, byleveraging the natural correspondence between the audio and visual modalities.An extensive analysis shows that the resulting clusters have high semanticoverlap to ground truth human labels. We further introduce the firstbenchmarking results on unsupervised labelling of common video datasetsKinetics, Kinetics-Sound, VGG-Sound and AVE.