Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Abstract

Autoregression in large language models (LLMs) has shown impressivescalability by unifying all language tasks into the next token predictionparadigm. Recently, there is a growing interest in extending this success tovision foundation models. In this survey, we review the recent advances anddiscuss future directions for autoregressive vision foundation models. First,we present the trend for next generation of vision foundation models, i.e.,unifying both understanding and generation in vision tasks. We then analyze thelimitations of existing vision foundation models, and present a formaldefinition of autoregression with its advantages. Later, we categorizeautoregressive vision foundation models from their vision tokenizers andautoregression backbones. Finally, we discuss several promising researchchallenges and directions. To the best of our knowledge, this is the firstsurvey to comprehensively summarize autoregressive vision foundation modelsunder the trend of unifying understanding and generation. A collection ofrelated resources is available at https://github.com/EmmaSRH/ARVFM.

Quick Read (beta)

loading the full paper ...