Abstract
Multimodal encoders have pushed the boundaries of visual document retrieval,matching textual query tokens directly to image patches and achievingstate-of-the-art performance on public benchmarks. Recent models relying onthis paradigm have massively scaled the sizes of their query and documentrepresentations, presenting obstacles to deployment and scalability inreal-world pipelines. Furthermore, purely vision-centric approaches may beconstrained by the inherent modality gap still exhibited by modernvision-language models. In this work, we connect these challenges to theparadigm of hybrid retrieval, investigating whether a lightweight dense textretriever can enhance a stronger vision-centric model. Existing hybrid methods,which rely on coarse-grained fusion of ranks or scores, fail to exploit therich interactions within each model's representation space. To address this, weintroduce Guided Query Refinement (GQR), a novel test-time optimization methodthat refines a primary retriever's query embedding using guidance from acomplementary retriever's scores. Through extensive experiments on visualdocument retrieval benchmarks, we demonstrate that GQR allows vision-centricmodels to match the performance of models with significantly largerrepresentations, while being up to 14x faster and requiring 54x less memory.Our findings show that GQR effectively pushes the Pareto frontier forperformance and efficiency in multimodal retrieval. We release our code athttps://github.com/IBM/test-time-hybrid-retrieval