Abstract
Current generative models, such as autoregressive and diffusion approaches,decompose high-dimensional data distribution learning into a series of simplersubtasks. However, inherent conflicts arise during the joint optimization ofthese subtasks, and existing solutions fail to resolve such conflicts withoutsacrificing efficiency or scalability. We propose a novel equivariant imagemodeling framework that inherently aligns optimization targets across subtasksby leveraging the translation invariance of natural visual signals. Our methodintroduces (1) column-wise tokenization which enhances translational symmetryalong the horizontal axis, and (2) windowed causal attention which enforcesconsistent contextual relationships across positions. Evaluated onclass-conditioned ImageNet generation at 256x256 resolution, our approachachieves performance comparable to state-of-the-art AR models while using fewercomputational resources. Systematic analysis demonstrates that enhancedequivariance reduces inter-task conflicts, significantly improving zero-shotgeneralization and enabling ultra-long image synthesis. This work establishesthe first framework for task-aligned decomposition in generative modeling,offering insights into efficient parameter sharing and conflict-freeoptimization. The code and models are publicly available athttps://github.com/drx-code/EquivariantModeling.