Abstract
Audio texture manipulation involves modifying the perceptual characteristicsof a sound to achieve specific transformations, such as adding, removing, orreplacing auditory elements. In this paper, we propose an exemplar-basedanalogy model for audio texture manipulation. Instead of conditioning ontext-based instructions, our method uses paired speech examples, where one cliprepresents the original sound and another illustrates the desiredtransformation. The model learns to apply the same transformation to new input,allowing for the manipulation of sound textures. We construct a quadrupletdataset representing various editing tasks, and train a latent diffusion modelin a self-supervised manner. We show through quantitative evaluations andperceptual studies that our model outperforms text-conditioned baselines andgeneralizes to real-world, out-of-distribution, and non-speech scenarios.Project page: https://berkeley-speech-group.github.io/audio-texture-analogy/