PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Abstract

Multiple works have emerged to push the boundaries on multi-modal largelanguage models (MLLMs) towards pixel-level understanding. The current trend inpixel-level MLLMs is to train with pixel-level grounding supervision onlarge-scale labelled data with specialized decoders for the segmentation task.However, we show that such MLLMs when evaluated on recent challengingvision-centric benchmarks, exhibit a weak ability in visual question answering(VQA). Surprisingly, some of these methods even downgrade the grounding abilityof MLLMs that were never trained with such pixel-level supervision. In thiswork, we propose two novel challenging benchmarks with paired evaluation forboth VQA and grounding. We show that MLLMs without pixel-level groundingsupervision can outperform the state of the art in such tasks. Our pairedbenchmarks and evaluation enable additional analysis on the reasons for failurewith respect to VQA and/or grounding. Furthermore, we propose simple baselinesto extract the grounding information that can be plugged into any MLLM, whichwe call PixFoundation. More importantly, we study the research question of"When does grounding emerge in MLLMs that are not trained with pixel-levelgrounding supervision?" We show that grounding can coincide with object parts,its location, appearance, context or state, where we show 27-45% of theexamples in both benchmarks exhibit this phenomenon. Our code and datasets willbe made publicly available and some are in the supplemental.

Quick Read (beta)

loading the full paper ...