Abstract
Multimodal foundation models, such as GPT-4o, have recently made remarkableprogress, but it is not clear where exactly these models stand in terms ofunderstanding vision. In this paper, we benchmark the performance of popularmultimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer visiontasks (semantic segmentation, object detection, image classification, depth andsurface normal prediction) using established datasets (e.g., COCO, ImageNet andits variants, etc). The main challenges to performing this are: 1) most models are trained tooutput text and cannot natively express versatile domains, such as segments or3D geometry, and 2) many leading models are proprietary and accessible only atan API level, i.e., there is no weight access to adapt them. We address thesechallenges by translating standard vision tasks into equivalent text-promptableand API-compatible tasks via prompt chaining to create a standardizedbenchmarking framework. We observe that 1) the models are not close to the state-of-the-artspecialist models at any task. However, 2) they are respectable generalists;this is remarkable as they are presumably trained on primarily image-text-basedtasks. 3) They perform semantic tasks notably better than geometric ones. 4)While the prompt-chaining techniques affect performance, better models exhibitless sensitivity to prompt variations. 5) GPT-4o performs the best amongnon-reasoning models, securing the top position in 4 out of 6 tasks, 6)reasoning models, e.g. o3, show improvements in geometric tasks, and 7) apreliminary analysis of models with native image generation, like the latestGPT-4o, shows they exhibit quirks like hallucinations and spatialmisalignments.