Abstract
Large language models are built on top of a transformer-based architecture toprocess textual inputs. For example, the LLaMA stands out among manyopen-source implementations. Can the same transformer be used to process 2Dimages? In this paper, we answer this question by unveiling a LLaMA-like visiontransformer in plain and pyramid forms, termed VisionLLaMA, which is tailoredfor this purpose. VisionLLaMA is a unified and generic modelling framework forsolving most vision tasks. We extensively evaluate its effectiveness usingtypical pre-training paradigms in a good portion of downstream tasks of imageperception and especially image generation. In many cases, VisionLLaMA haveexhibited substantial gains over the previous state-of-the-art visiontransformers. We believe that VisionLLaMA can serve as a strong new baselinemodel for vision generation and understanding. Our code is released athttps://github.com/Meituan-AutoML/VisionLLaMA.