Abstract
Universal multimodal embedding models have achieved great success incapturing semantic relevance between queries and candidates. However, currentmethods either condense queries and candidates into a single vector,potentially limiting the expressiveness for fine-grained information, orproduce too many vectors that are prohibitively expensive for multi-vectorretrieval. In this work, we introduce MetaEmbed, a new framework for multimodalretrieval that rethinks how multimodal embeddings are constructed andinteracted with at scale. During training, a fixed number of learnable MetaTokens are appended to the input sequence. At test-time, their last-layercontextualized representations serve as compact yet expressive multi-vectorembeddings. Through the proposed Matryoshka Multi-Vector Retrieval training,MetaEmbed learns to organize information by granularity across multiplevectors. As a result, we enable test-time scaling in multimodal retrieval,where users can balance retrieval quality against efficiency demands byselecting the number of tokens used for indexing and retrieval interactions.Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) andthe Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbedachieves state-of-the-art retrieval performance while scaling robustly tomodels with 32B parameters.