UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Abstract

Generalist models have achieved remarkable success in both language andvision-language tasks, showcasing the potential of unified modeling. However,effectively integrating fine-grained perception tasks like detection andsegmentation into these models remains a significant challenge. This isprimarily because these tasks often rely heavily on task-specific designs andarchitectures that can complicate the modeling process. To address thischallenge, we present \ours, a framework that \textbf{U}nifies\textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-endedlanguage interface. By transforming all perception targets into the languagespace, \ours unifies object-level detection, pixel-level segmentation, andimage-level vision-language tasks into a single model. Additionally, weintroduce a novel embedding retrieval approach that relies solely on thelanguage interface to support segmentation tasks. Our framework bridges the gapbetween fine-grained perception and vision-language tasks, significantlysimplifying architectural design and training strategies while achievingcomparable or superior performance to methods with intricate task-specificdesigns. After multi-task training on five standard visual perception datasets,\ours outperforms the previous state-of-the-art generalist models by 12.3 mAPon COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation.Furthermore, our method seamlessly integrates with existing MLLMs, effectivelycombining fine-grained perception capabilities with their advanced languageabilities, thereby enabling more challenging tasks such as reasoningsegmentation. Code and models are available at https://github.com/nnnth/UFO.

Quick Read (beta)

loading the full paper ...