Can pre-trained BERT for one language and GPT for another be glued togetherto translate texts? Self-supervised training using only monolingual data hasled to the success of pre-trained (masked) language models in many NLP tasks.However, directly connecting BERT as an encoder and GPT as a decoder can bechallenging in machine translation, for GPT-like models lack a cross-attentioncomponent that is needed in seq2seq decoders. In this paper, we proposeGraformer to graft separately pre-trained (masked) language models for machinetranslation. With monolingual data for pre-training and parallel data forgrafting training, we maximally take advantage of the usage of both types ofdata. Experiments on 60 directions show that our method achieves averageimprovements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing withthe multilingual Transformer of the same size.