Abstract
In this paper, we introduce Online Multimodal Conversational ResponseGeneration (OMCRG), a novel task designed to produce synchronized verbal andnon-verbal listener feedback online, based on the speaker's multimodal inputs.OMCRG captures natural dyadic interactions and introduces new challenges inaligning generated audio with listeners' facial responses. To tackle thesechallenges, we incorporate text as an intermediate modality to connect audioand facial responses. We propose OmniResponse, a Multimodal Large LanguageModel (MLLM) that autoregressively generates accurate multimodal listenerresponses. OmniResponse leverages a pretrained LLM enhanced with two corecomponents: Chrono-Text Markup, which precisely timestamps generated texttokens, and TempoVoice, a controllable online text-to-speech (TTS) module thatoutputs speech synchronized with facial responses. To advance OMCRG research,we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuringsynchronized split-screen videos, multichannel audio, transcripts, andannotated facial behaviors. Comprehensive evaluations on ResponseNetdemonstrate that OmniResponse outperforms baseline models in terms of semanticspeech content, audio-visual synchronization, and generation quality. Ourdataset, code, and models are publicly available.