Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

  • 2025-04-14 18:52:22
  • Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng
  • 0

Abstract

Multimodal Large Language Models (MLLMs) achieve remarkable performance forfine-grained pixel-level understanding tasks. However, all the works relyheavily on extra components, such as vision encoder (CLIP), segmentationexperts, leading to high system complexity and limiting model scaling. In thiswork, our goal is to explore a highly simplified MLLM without introducing extracomponents. Our work is motivated by the recent works on Single trAnsformer asa unified vIsion-Language Model (SAIL) design, where these works jointly learnvision tokens and text tokens in transformers. We present Pixel-SAIL, a singletransformer for pixel-wise MLLM tasks. In particular, we present threetechnical improvements on the plain baseline. First, we design a learnableupsampling module to refine visual token features. Secondly, we propose a novelvisual prompt injection strategy to enable the single transformer to understandvisual prompt inputs and benefit from the early fusion of visual promptembeddings and vision tokens. Thirdly, we introduce a vision expertdistillation strategy to efficiently enhance the single transformer'sfine-grained feature extraction capability. In addition, we have collected acomprehensive pixel understanding benchmark (PerBench), using a manual check.It includes three tasks: detailed object description, visual prompt-basedquestion answering, and visual-text referring segmentation. Extensiveexperiments on four referring segmentation benchmarks, one visual promptbenchmark, and our PerBench show that our Pixel-SAIL achieves comparable oreven better results with a much simpler pipeline. Code and model will bereleased at https://github.com/magic-research/Sa2VA.

 

Quick Read (beta)

loading the full paper ...