XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Abstract

Transformer-based models are widely used in natural language understanding(NLU) tasks, and multimodal transformers have been effective in visual-languagetasks. This study explores distilling visual information from pretrainedmultimodal transformers to pretrained language encoders. Our framework isinspired by cross-modal encoders' success in visual-language tasks while wealter the learning objective to cater to the language-heavy characteristics ofNLU. After training with a small number of extra adapting steps and finetuned,the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT ingeneral language understanding evaluation (GLUE), situations with adversarialgenerations (SWAG) benchmarks, and readability benchmarks. We analyze theperformance of XDBERT on GLUE to show that the improvement is likely visuallygrounded.

Quick Read (beta)

loading the full paper ...