VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Abstract

Since visual perception can give rich information beyond text descriptionsfor world understanding, there has been increasing interest in leveragingvisual grounding for language learning. Recently, vokenization (Tan and Bansal,2020) has attracted attention by using the predictions of a text-to-imageretrieval model as labels for language model supervision. Despite its success,the method suffers from approximation error of using finite image labels andthe lack of vocabulary diversity of a small image-text dataset. To overcomethese limitations, we present VidLanKD, a video-language knowledge distillationmethod for improving language understanding. We train a multi-modal teachermodel on a video-text dataset, and then transfer its knowledge to a studentlanguage model with a text dataset. To avoid approximation error, we propose touse different knowledge distillation objectives. In addition, the use of alarge-scale video-text dataset helps learn diverse and richer vocabularies. Inour experiments, VidLanKD achieves consistent improvements over text-onlylanguage models and vokenization models, on several downstream languageunderstanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate theimproved world knowledge, physical reasoning, and temporal reasoningcapabilities of our model by evaluating on the GLUE-diagnostics, PIQA, andTRACIE datasets. Lastly, we present comprehensive ablation studies as well asvisualizations of the learned text-to-video grounding results of our teacherand student language models. Our code and models are available at:https://github.com/zinengtang/VidLanKD

Quick Read (beta)

loading the full paper ...