Abstract
Large scale pretrained language models have demonstrated state-of-the-artperformance in language understanding tasks. Their application has recentlyexpanded into multimodality learning, leading to improved representationscombining vision and language. However, progress in adapting language modelstowards conditional Natural Language Generation (NLG) has been limited to asingle modality, generally text. We propose MAnTiS, Multimodal Adaptation forText Synthesis, a general approach for multimodal conditionality intransformer-based NLG models. In this method, we pass inputs from each modalitythrough modality-specific encoders, project to textual token space, and finallyjoin to form a conditionality prefix. We fine-tune the pretrained languagemodel and encoders with the conditionality prefix guiding the generation. Weapply MAnTiS to the task of product description generation, conditioning anetwork on both product images and titles to generate descriptive text. Wedemonstrate that MAnTiS outperforms strong baseline approaches on standard NLGscoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiScan generate human quality descriptions consistent with given multimodalinputs.