Abstract
Diffusion models have achieved impressive success in generatingphotorealistic images, but challenges remain in ensuring precise semanticalignment with input prompts. Optimizing the initial noisy latent offers a moreefficient alternative to modifying model architectures or prompt engineeringfor improving semantic alignment. A latest approach, InitNo, refines theinitial noisy latent by leveraging attention maps; however, these maps captureonly limited information, and the effectiveness of InitNo is highly dependenton the initial starting point, as it tends to converge on a local optimum nearthis point. To this end, this paper proposes leveraging the languagecomprehension capabilities of large vision-language models (LVLMs) to guide theoptimization of the initial noisy latent, and introduces the Noise Diffusionprocess, which updates the noisy latent to generate semantically faithfulimages while preserving distribution consistency. Furthermore, we provide atheoretical analysis of the condition under which the update improves semanticfaithfulness. Experimental results demonstrate the effectiveness andadaptability of our framework, consistently enhancing semantic alignment acrossvarious diffusion models. The code is available athttps://github.com/Bomingmiao/NoiseDiffusion.