Abstract
Recently there has been a significant surge in multimodal learning in termsof both image-to-text and text-to-image generation. However, the success istypically limited to English, leaving other languages largely behind. Buildinga competitive counterpart in other languages is highly challenging due to thelow-resource nature of non-English multimodal data (i.e., lack of large-scale,high-quality image-text data). In this work, we propose MPM, an effectivetraining paradigm for training large multimodal models in non-Englishlanguages. MPM demonstrates that Multilingual language models can Pivotzero-shot Multimodal learning across languages. Specifically, based on a strongmultilingual large language model, multimodal models pretrained on English-onlyimage-text data can well generalize to other languages in a (quasi)-zero-shotmanner, even surpassing models trained on image-text data in native languages.Taking Chinese as a practice of MPM, we build large multimodal models VisCPM inimage-to-text and text-to-image generation, which achieve state-of-the-art(open-source) performance in Chinese. To facilitate future research, weopen-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.