Abstract
Pre-trained Vision Foundation Models (VFMs) provide strong visualrepresentations for a wide range of applications. In this paper, we continuallypre-train prevailing VFMs in a multimodal manner such that they caneffortlessly process visual inputs of varying sizes and produce visualrepresentations that are more aligned with language representations, regardlessof their original pre-training process. To this end, we introduce CoMP, acarefully designed multimodal pre-training pipeline. CoMP uses a ContinualRotary Position Embedding to support native resolution continual pre-training,and an Alignment Loss between visual and textual features through languageprototypes to align multimodal representations. By three-stage training, ourVFMs achieve remarkable improvements not only in multimodal understanding butalso in other downstream tasks such as classification and segmentation.Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQAwith a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5mIoU on ADE20K under frozen chunk evaluation.