Abstract
Recent advances in generative modeling have positioned diffusion models asstate-of-the-art tools for sampling from complex data distributions. Whilethese models have shown remarkable success across single-modality domains suchas images and audio, extending their capabilities to Modality Translation (MT),translating information across different sensory modalities, remains an openchallenge. Existing approaches often rely on restrictive assumptions, includingshared dimensionality, Gaussian source priors, and modality-specificarchitectures, which limit their generality and theoretical grounding. In thiswork, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), ageneral-purpose framework for modality translation based on a latent-variableextension of Denoising Diffusion Bridge Models. By operating in a shared latentspace, our method learns a bridge between arbitrary modalities withoutrequiring aligned dimensions. We introduce a contrastive alignment loss toenforce semantic consistency between paired samples and design adomain-agnostic encoder-decoder architecture tailored for noise prediction inlatent space. Additionally, we propose a predictive loss to guide trainingtoward accurate cross-domain translation and explore several trainingstrategies to improve stability. Our approach supports arbitrary modality pairsand performs strongly on diverse MT tasks, including multi-view to 3D shapegeneration, image super-resolution, and multi-view scene synthesis.Comprehensive experiments and ablations validate the effectiveness of ourframework, establishing a new strong baseline in general modality translation.For more information, see our project page:https://sites.google.com/view/lddbm/home.