Abstract
We address the challenge of relighting a single image or video, a task thatdemands precise scene intrinsic understanding and high-quality light transportsynthesis. Existing end-to-end relighting models are often limited by thescarcity of paired multi-illumination data, restricting their ability togeneralize across diverse scenes. Conversely, two-stage pipelines that combineinverse and forward rendering can mitigate data requirements but aresusceptible to error accumulation and often fail to produce realistic outputsunder complex lighting conditions or with sophisticated materials. In thiswork, we introduce a general-purpose approach that jointly estimates albedo andsynthesizes relit outputs in a single pass, harnessing the generativecapabilities of video diffusion models. This joint formulation enhancesimplicit scene comprehension and facilitates the creation of realistic lightingeffects and intricate material interactions, such as shadows, reflections, andtransparency. Trained on synthetic multi-illumination data and extensiveautomatically labeled real-world videos, our model demonstrates stronggeneralization across diverse domains and surpasses previous methods in bothvisual fidelity and temporal consistency.