Abstract
Autoregressive image generation models like Janus-Pro produce high-qualityimages, but at the significant cost of high memory and ever-growingcomputational demands due to the large number of visual tokens. While KV cachecompression has been extensively studied in language modeling, it still remainslargely unexplored for the image generation domain. In this work, we begin byidentifying a distinct and prominent attention phenomenon, which we termspatial locality and emergent semantic sink. To leverage this key insight, weintroduce a novel KV cache compression framework. Specifically, we compress theKV cache for all visual tokens by adaptively decoupling attention heads intotwo separate types: for spatial-locality heads, our method maintains a shortrecent token window; for semantic-sink heads, it strategically preserves acompact set of highly-attended tokens. Our extensive experiments demonstratethat the proposed method achieves a 5$\times$ reduction in memory usage and anotable 6.6$\times$ speedup in overall throughput with only minimal visualquality loss, thereby enabling highly efficient native autoregressive imagegeneration on resource-constrained hardware.