Multimodal headline utilizes both video frames and transcripts to generatethe natural language title of the videos. Due to a lack of large-scale,manually annotated data, the task of annotating grounded headlines for video islabor intensive and impractical. Previous researches on pre-trained languagemodels and video-language models have achieved significant progress in relateddownstream tasks. However, none of them can be directly applied to multimodalheadline architecture where we need both multimodal encoder and sentencedecoder. A major challenge in simply gluing language model and video-languagemodel is the modality balance, which is aimed at combining visual-languagecomplementary abilities. In this paper, we propose a novel approach to graftthe video encoder from the pre-trained video-language model on the generativepre-trained language model. We also present a consensus fusion mechanism forthe integration of different components, via inter/intra modality relation.Empirically, experiments show that the grafted model achieves strong results ona brand-new dataset collected from real-world applications.