In this paper, we study how to use masked signal modeling in vision andlanguage (V+L) representation learning. Instead of developing masked languagemodeling (MLM) and masked image modeling (MIM) independently, we propose tobuild joint masked vision and language modeling, where the masked signal of onemodality is reconstructed with the help from another modality. This ismotivated by the nature of image-text paired data that both of the image andthe text convey almost the same information but in different formats. Themasked signal reconstruction of one modality conditioned on another modalitycan also implicitly learn cross-modal alignment between language tokens andimage patches. Our experiments on various V+L tasks show that the proposedmethod not only achieves state-of-the-art performances by using a large amountof data, but also outperforms the other competitors by a significant margin inthe regimes of limited training data.