Abstract
Objects produce different sounds when hit, and humans can intuitively inferhow an object might sound based on its appearance and material properties.Inspired by this intuition, we propose Visual Acoustic Fields, a framework thatbridges hitting sounds and visual signals within a 3D space using 3D GaussianSplatting (3DGS). Our approach features two key modules: sound generation andsound localization. The sound generation module leverages a conditionaldiffusion model, which takes multiscale features rendered from afeature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, thesound localization module enables querying the 3D scene, represented by thefeature-augmented 3DGS, to localize hitting positions based on the soundsources. To support this framework, we introduce a novel pipeline forcollecting scene-level visual-sound sample pairs, achieving alignment betweencaptured images, impact locations, and corresponding sounds. To the best of ourknowledge, this is the first dataset to connect visual and acoustic signals ina 3D context. Extensive experiments on our dataset demonstrate theeffectiveness of Visual Acoustic Fields in generating plausible impact soundsand accurately localizing impact sources. Our project page is athttps://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.