Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

Abstract

Training audio-to-image generative models requires an abundance of diverseaudio-visual pairs that are semantically aligned. Such data is almost alwayscurated from in-the-wild videos, given the cross-modal semantic correspondencethat is inherent to them. In this work, we hypothesize that insisting on theabsolute need for ground truth audio-visual correspondence, is not onlyunnecessary, but also leads to severe restrictions in scale, quality, anddiversity of the data, ultimately impairing its use in the modern generativemodels. That is, we propose a scalable image sonification framework whereinstances from a variety of high-quality yet disjoint uni-modal origins can beartificially paired through a retrieval process that is empowered by reasoningcapabilities of modern vision-language models. To demonstrate the efficacy ofthis approach, we use our sonified images to train an audio-to-image generativemodel that performs competitively against state-of-the-art. Finally, through aseries of ablation studies, we exhibit several intriguing auditory capabilitieslike semantic mixing and interpolation, loudness calibration and acoustic spacemodeling through reverberation that our model has implicitly developed to guidethe image generation process.

Quick Read (beta)

loading the full paper ...