Abstract
Recent years have witnessed the success of foundation models pre-trained withself-supervised learning (SSL) in various music informatics understandingtasks, including music tagging, instrument classification, key detection, andmore. In this paper, we propose a self-supervised music representation learningmodel for music understanding. Distinguished from previous studies adoptingrandom projection or existing neural codec, the proposed model, named MuQ, istrained to predict tokens generated by Mel Residual Vector Quantization(Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Melspectrum quantization to enhance the stability and efficiency of targetextraction and lead to better performance. Experiments in a large variety ofdownstream tasks demonstrate that MuQ outperforms previous self-supervisedmusic representation models with only 0.9K hours of open-source pre-trainingdata. Scaling up the data to over 160K hours and adopting iterative trainingconsistently improve the model performance. To further validate the strength ofour model, we present MuQ-MuLan, a joint music-text embedding model based oncontrastive learning, which achieves state-of-the-art performance in thezero-shot music tagging task on the MagnaTagATune dataset. Code and checkpointsare open source in https://github.com/tencent-ailab/MuQ.