Codified audio language modeling learns useful representations for music information retrieval

Abstract

We demonstrate that language models pre-trained on codified(discretely-encoded) music audio learn representations that are useful fordownstream MIR tasks. Specifically, we explore representations from Jukebox(Dhariwal et al. 2020): a music generation system containing a language modeltrained on codified audio from 1M songs. To determine if Jukebox'srepresentations contain useful information for MIR, we use them as inputfeatures to train shallow models on several MIR tasks. Relative torepresentations from conventional MIR models which are pre-trained on tagging,we find that using representations from Jukebox as input features yields 30%stronger performance on average across four MIR tasks: tagging, genreclassification, emotion recognition, and key detection. For key detection, weobserve that representations from Jukebox are considerably stronger than thosefrom models pre-trained on tagging, suggesting that pre-training via codifiedaudio language modeling may address blind spots in conventional approaches. Weinterpret the strength of Jukebox's representations as evidence that modelingaudio instead of tags provides richer representations for MIR.

Quick Read (beta)

loading the full paper ...