Abstract
Text-guided color editing in images and videos is a fundamental yet unsolvedproblem, requiring fine-grained manipulation of color attributes, includingalbedo, light source color, and ambient lighting, while preserving physicalconsistency in geometry, material properties, and light-matter interactions.Existing training-free methods offer broad applicability across editing tasksbut struggle with precise color control and often introduce visualinconsistency in both edited and non-edited regions. In this work, we presentColorCtrl, a training-free color editing method that leverages the attentionmechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). Bydisentangling structure and color through targeted manipulation of attentionmaps and value tokens, our method enables accurate and consistent colorediting, along with word-level control of attribute intensity. Our methodmodifies only the intended regions specified by the prompt, leaving unrelatedareas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstratethat ColorCtrl outperforms existing training-free approaches and achievesstate-of-the-art performances in both edit quality and consistency.Furthermore, our method surpasses strong commercial models such as FLUX.1Kontext Max and GPT-4o Image Generation in terms of consistency. When extendedto video models like CogVideoX, our approach exhibits greater advantages,particularly in maintaining temporal coherence and editing stability. Finally,our method also generalizes to instruction-based editing diffusion models suchas Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.