HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

Abstract

The potential for higher-resolution image generation using pretraineddiffusion models is immense, yet these models often struggle with issues ofobject repetition and structural artifacts especially when scaling to 4Kresolution and higher. We figure out that the problem is caused by that, asingle prompt for the generation of multiple scales provides insufficientefficacy. In response, we propose HiPrompt, a new tuning-free solution thattackles the above problems by introducing hierarchical prompts. Thehierarchical prompts offer both global and local guidance. Specifically, theglobal guidance comes from the user input that describes the overall content,while the local guidance utilizes patch-wise descriptions from MLLMs toelaborately guide the regional structure and texture generation. Furthermore,during the inverse denoising process, the generated noise is decomposed intolow- and high-frequency spatial components. These components are conditioned onmultiple prompt levels, including detailed patch-wise descriptions and broaderimage-level prompts, facilitating prompt-guided denoising under hierarchicalsemantic guidance. It further allows the generation to focus more on localspatial regions and ensures the generated images maintain coherent local andglobal semantics, structures, and textures with high definition. Extensiveexperiments demonstrate that HiPrompt outperforms state-of-the-art works inhigher-resolution image generation, significantly reducing object repetitionand enhancing structural quality.

Quick Read (beta)

loading the full paper ...