Abstract
Contrastive vision-language models such as CLIP have demonstrated strongperformance across a wide range of multimodal tasks by learning from alignedimage-text pairs. However, their ability to handle complex, real-world webdocuments remains limited, particularly in scenarios where text and images areinterleaved, loosely aligned, or embedded in visual form. To address thesechallenges, we propose Vision-Centric Contrastive Learning (VC2L), a unifiedframework that models text, images, and their combinations using a singlevision transformer. VC2L operates entirely in pixel space by rendering allinputs, whether textual, visual, or combined, as images, thus eliminating theneed for OCR, text tokenization, or modality fusion strategy. To capturecomplex cross-modal relationships in multimodal web documents, VC2L employs asnippet-level contrastive learning objective that aligns consecutive multimodalsegments, leveraging the inherent coherence of documents without requiringexplicitly paired image-text data. To assess the effectiveness of thisapproach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR,designed to evaluate cross-modal retrieval, fine-grained sequentialunderstanding, and generalization to unseen data, respectively. Empiricalresults show that VC2L achieves competitive or superior performance compared toCLIP-style models on both the proposed benchmarks and established datasets suchas M-BEIR and MTEB. These findings underscore the potential of multimodal webdata as a valuable training resource for contrastive learning and illustratethe scalability of a unified, vision-centric approach for multimodalrepresentation learning. Code and models are available at:https://github.com/showlab/VC2L.