Abstract
Transformer-based models are widely used in natural language understanding(NLU) tasks, and multimodal transformers have been effective in visual-languagetasks. This study explores distilling visual information from pretrainedmultimodal transformers to pretrained language encoders. Our framework isinspired by cross-modal encoders' success in visual-language tasks while wealter the learning objective to cater to the language-heavy characteristics ofNLU. After training with a small number of extra adapting steps and finetuned,the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT ingeneral language understanding evaluation (GLUE), situations with adversarialgenerations (SWAG) benchmarks, and readability benchmarks. We analyze theperformance of XDBERT on GLUE to show that the improvement is likely visuallygrounded.