Fine-Grained Perturbation Guidance via Attention Head Selection

Abstract

Recent guidance methods in diffusion models steer reverse sampling byperturbing the model to construct an implicit weak model and guide generationaway from it. Among these approaches, attention perturbation has demonstratedstrong empirical performance in unconditional scenarios where classifier-freeguidance is not applicable. However, existing attention perturbation methodslack principled approaches for determining where perturbations should beapplied, particularly in Diffusion Transformer (DiT) architectures wherequality-relevant computations are distributed across layers. In this paper, weinvestigate the granularity of attention perturbations, ranging from the layerlevel down to individual attention heads, and discover that specific headsgovern distinct visual concepts such as structure, style, and texture quality.Building on this insight, we propose "HeadHunter", a systematic framework foriteratively selecting attention heads that align with user-centric objectives,enabling fine-grained control over generation quality and visual attributes. Inaddition, we introduce SoftPAG, which linearly interpolates each selectedhead's attention map toward an identity matrix, providing a continuous knob totune perturbation strength and suppress artifacts. Our approach not onlymitigates the oversmoothing issues of existing layer-level perturbation butalso enables targeted manipulation of specific visual styles throughcompositional head selection. We validate our method on modern large-scaleDiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,demonstrating superior performance in both general quality enhancement andstyle-specific guidance. Our work provides the first head-level analysis ofattention perturbation in diffusion models, uncovering interpretablespecialization within attention layers and enabling practical design ofeffective perturbation strategies.

Quick Read (beta)

loading the full paper ...