The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Abstract

This paper introduces SAIL, a single transformer unified multimodal largelanguage model (MLLM) that integrates raw pixel encoding and language decodingwithin a singular architecture. Unlike existing modular MLLMs, which rely on apre-trained vision transformer (ViT), SAIL eliminates the need for a separatevision encoder, presenting a more minimalist architecture design. Instead ofintroducing novel architectural components, SAIL adapts mix-attentionmechanisms and multimodal positional encodings to better align with thedistinct characteristics of visual and textual modalities. We systematicallycompare SAIL's properties-including scalability, cross-modal information flowpatterns, and visual representation capabilities-with those of modular MLLMs.By scaling both training data and model size, SAIL achieves performancecomparable to modular MLLMs. Notably, the removal of pretrained ViT componentsenhances SAIL's scalability and results in significantly different cross-modalinformation flow patterns. Moreover, SAIL demonstrates strong visualrepresentation capabilities, achieving results on par with ViT-22B in visiontasks such as semantic segmentation. Code and models are available athttps://github.com/bytedance/SAIL.

Quick Read (beta)

loading the full paper ...