Abstract
Although audio generation shares commonalities across different types ofaudio, such as speech, music, and sound effects, designing models for each typerequires careful consideration of specific objectives and biases that cansignificantly differ from those of other types. To bring us closer to a unifiedperspective of audio generation, this paper proposes a framework that utilizesthe same learning method for speech, music, and sound effect generation. Ourframework introduces a general representation of audio, called language ofaudio (LOA). Any audio can be translated into LOA based on AudioMAE, aself-supervised pre-trained representation learning model. In the generationprocess, we translate any modalities into LOA by using a GPT-2 model, and weperform self-supervised audio generation learning with a latent diffusion modelconditioned on LOA. The proposed framework naturally brings advantages such asin-context learning abilities and reusable self-supervised pretrained AudioMAEand latent diffusion models. Experiments on the major benchmarks oftext-to-audio, text-to-music, and text-to-speech demonstrate newstate-of-the-art or competitive performance to previous approaches. Our demoand code are available at https://audioldm.github.io/audioldm2.