MIEB: Massive Image Embedding Benchmark

Abstract

Image representations are often evaluated through disjointed, task-specificprotocols, leading to a fragmented understanding of model capabilities. Forinstance, it is unclear whether an image embedding model adept at clusteringimages is equally good at retrieving relevant images given a piece of text. Weintroduce the Massive Image Embedding Benchmark (MIEB) to evaluate theperformance of image and image-text embedding models across the broadestspectrum to date. MIEB spans 38 languages across 130 individual tasks, which wegroup into 8 high-level categories. We benchmark 50 models across ourbenchmark, finding that no single method dominates across all task categories.We reveal hidden capabilities in advanced vision models such as their accuratevisual representation of texts, and their yet limited capabilities ininterleaved encodings and matching images and texts in the presence ofconfounders. We also show that the performance of vision encoders on MIEBcorrelates highly with their performance when used in multimodal large languagemodels. Our code, dataset, and leaderboard are publicly available athttps://github.com/embeddings-benchmark/mteb.

Quick Read (beta)

loading the full paper ...