Abstract
In this work, we present and evaluate SELMA, a Speech-Enabled Language Modelfor virtual Assistant interactions that integrates audio and text as inputs toa Large Language Model (LLM). SELMA is designed to handle three primary and twoauxiliary tasks related to interactions with virtual assistants simultaneouslywithin a single end-to-end model. We employ low-rank adaptation modules forparameter-efficient training of both the audio encoder and the LLM.Additionally, we implement a feature pooling strategy enabling the system torecognize global patterns and improve accuracy on tasks less reliant onindividual sequence elements. Experimental results on Voice Trigger (VT)detection, Device-Directed Speech Detection (DDSD), and Automatic SpeechRecognition (ASR), demonstrate that our approach both simplifies the typicalinput processing pipeline of virtual assistants significantly and also improvesperformance compared to dedicated models for each individual task. SELMA yieldsrelative Equal-Error Rate improvements of 64% on the VT detection task, and 22%on DDSD, while also achieving word error rates close to the baseline.