Abstract
Text-to-image generation has seen groundbreaking advancements with diffusionmodels, enabling high-fidelity synthesis and precise image editing throughcross-attention manipulation. Recently, autoregressive (AR) models havere-emerged as powerful alternatives, leveraging next-token generation to matchdiffusion models. However, existing editing techniques designed for diffusionmodels fail to translate directly to AR models due to fundamental differencesin structural control. Specifically, AR models suffer from spatial poverty ofattention maps and sequential accumulation of structural errors during imageediting, which disrupt object layouts and global consistency. In this work, weintroduce Implicit Structure Locking (ISLock), the first training-free editingstrategy for AR visual models. Rather than relying on explicit attentionmanipulation or fine-tuning, ISLock preserves structural blueprints bydynamically aligning self-attention patterns with reference images through theAnchor Token Matching (ATM) protocol. By implicitly enforcing structuralconsistency in latent space, our method ISLock enables structure-aware editingwhile maintaining generative autonomy. Extensive experiments demonstrate thatISLock achieves high-quality, structure-consistent edits without additionaltraining and is superior or comparable to conventional editing techniques. Ourfindings pioneer the way for efficient and flexible AR-based image editing,further bridging the performance gap between diffusion and autoregressivegenerative models. The code will be publicly available athttps://github.com/hutaiHang/ATM