Abstract
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Videogeneration that leverages the activations of frozen video and audio diffusionmodels for temporally-aligned cross-modal conditioning. The key to ourframework is a Fusion Block that enables bidirectional information exchangebetween our backbone video and audio diffusion models through atemporally-aligned self attention operation. Unlike prior work that usesfeature extractors pretrained for other tasks for the conditioning signal,AV-Link can directly leverage features obtained by the complementary modalityin a single framework i.e. video features to generate audio, or audio featuresto generate video. We extensively evaluate our design choices and demonstratethe ability of our method to achieve synchronized and high-quality audiovisualcontent, showcasing its potential for applications in immersive mediageneration. Project Page: snap-research.github.io/AVLink/