Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues

Abstract

Research in linguistics shows that non-verbal cues, such as gestures, play acrucial role in spoken discourse. For example, speakers perform hand gesturesto indicate topic shifts, helping listeners identify transitions in discourse.In this work, we investigate whether the joint modeling of gestures using humanmotion sequences and language can improve spoken discourse modeling in languagemodels. To integrate gestures into language models, we first encode 3D humanmotion sequences into discrete gesture tokens using a VQ-VAE. These gesturetoken embeddings are then aligned with text embeddings through featurealignment, mapping them into the text embedding space. To evaluate thegesture-aligned language model on spoken discourse, we construct text infillingtasks targeting three key discourse cues grounded in linguistic research:discourse connectives, stance markers, and quantifiers. Results show thatincorporating gestures enhances marker prediction accuracy across the threetasks, highlighting the complementary information that gestures can offer inmodeling spoken discourse. We view this work as an initial step towardleveraging non-verbal cues to advance spoken language modeling in languagemodels.

Quick Read (beta)

loading the full paper ...