Abstract
Pre-trained Vision Foundation Models (VFMs) provide strong visualrepresentations for a wide range of applications. In this paper, we continuallypre-train prevailing VFMs in a multimodal manner such that they caneffortlessly process visual inputs of varying sizes and produce visualrepresentations that are more aligned with language representations, regardlessof their original pre-training process. To this end, we introduce CoMP, acarefully designed multimodal pre-training pipeline. CoMP uses a ContinualRotary Position Embedding to accommodate visual inputs with differentresolutions, and an Alignment Loss between visual and textual features forbetter cross-modal alignment. After continual pre-training, leading VFMs likeDINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodalunderstanding tasks but also in generic classification and segmentation tasks.Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM,while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20Kunder frozen chunk evaluation.