Multimodal Embeddings from Language Models

Abstract

Word embeddings such as ELMo have recently been shown to model word semanticswith greater efficacy through contextualized learning on large-scale languagecorpora, resulting in significant improvement in state of the art across manynatural language tasks. In this work we integrate acoustic information intocontextualized lexical embeddings through the addition of multimodal inputs toa pretrained bidirectional language model. The language model is trained onspoken language that includes text and audio modalities. The resultingrepresentations from this model are multimodal and contain paralinguisticinformation which can modify word meanings and provide affective information.We show that these multimodal embeddings can be used to improve over previousstate of the art multimodal models in emotion recognition on the CMU-MOSEIdataset.

Quick Read (beta)

loading the full paper ...