Vision-and-language pre-training has achieved impressive success in learningmultimodal representations between vision and language. To generalize thissuccess to non-English languages, we introduce UC2, the first machinetranslation-augmented framework for cross-lingual cross-modal representationlearning. To tackle the scarcity problem of multilingual captions for imagedatasets, we first augment existing English-only datasets with other languagesvia machine translation (MT). Then we extend the standard Masked LanguageModeling and Image-Text Matching training objectives to multilingual setting,where alignment between different languages is captured through shared visualcontext (i.e, using image as pivot). To facilitate the learning of a jointembedding space of images and all languages of interest, we further propose twonovel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) andVisual Translation Language Modeling (VTLM), leveraging MT-enhanced translateddata. Evaluation on multilingual image-text retrieval and multilingual visualquestion answering benchmarks demonstrates that our proposed framework achievesnew state-of-the-art on diverse non-English benchmarks while maintainingcomparable performance to monolingual pre-trained models on English tasks.