PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Abstract

Panoramic Image Generation (PIG) aims to create coherent images of arbitrarylengths. Most existing methods fall in the joint diffusion paradigm, but theircomplex and heuristic crop connection designs often limit their ability toachieve multilevel coherence. By deconstructing this challenge into its corecomponents, we find it naturally aligns with next-token prediction, leading usto adopt an autoregressive (AR) paradigm for PIG modeling. However, existingvisual AR (VAR) models are limited to fixed-size generation, lacking thecapability to produce panoramic images. In this paper, we propose PanoLlama, anovel framework that achieves endless and coherent panorama generation with theautoregressive paradigm. Our approach develops a training-free strategy thatutilizes token redirection to overcome the size limitations of existing VARmodels, enabling next-crop prediction in both horizontal and verticaldirections. This refreshes the PIG pipeline while achieving SOTA performance incoherence (47.50%), fidelity(28.16%), and aesthetics (15%). Additionally,PanoLlama supports applications other PIG methods cannot achieve, includingmask-free layout control, multi-scale and multi-guidance synthesis. Tofacilitate standardized evaluation, we also establish a dataset with 1,000prompts spanning 100+ themes, providing a new testing benchmark for PIGresearch. The code is available at https://github.com/0606zt/PanoLlama.

Quick Read (beta)

loading the full paper ...