Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Abstract

Humans learn language by listening, speaking, writing, reading, and also, viainteraction with the multimodal real world. Existing language pre-trainingframeworks show the effectiveness of text-only self-supervision while weexplore the idea of a visually-supervised language model in this paper. We findthat the main reason hindering this exploration is the large divergence inmagnitude and distributions between the visually-grounded language datasets andpure-language corpora. Therefore, we develop a technique named "vokenization"that extrapolates multimodal alignments to language-only data by contextuallymapping language tokens to their related images (which we call "vokens"). The"vokenizer" is trained on relatively small image captioning datasets and wethen apply it to generate vokens for large language corpora. Trained with thesecontextually generated vokens, our visually-supervised language models showconsistent improvements over self-supervised alternatives on multiplepure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained modelspublicly available at https://github.com/airsplay/vokenization

Quick Read (beta)

loading the full paper ...