Abstract
One common approach for question answering over speech data is to firsttranscribe speech using automatic speech recognition (ASR) and then employtext-based retrieval-augmented generation (RAG) on the transcriptions. Whilethis cascaded pipeline has proven effective in many practical settings, ASRerrors can propagate to the retrieval and generation steps. To overcome thislimitation, we introduce SpeechRAG, a novel framework designed foropen-question answering over spoken data. Our proposed approach fine-tunes apre-trained speech encoder into a speech adapter fed into a frozen largelanguage model (LLM)--based retrieval model. By aligning the embedding spacesof text and speech, our speech retriever directly retrieves audio passages fromtext-based queries, leveraging the retrieval capacity of the frozen textretriever. Our retrieval experiments on spoken question answering datasets showthat direct speech retrieval does not degrade over the text-based baseline, andoutperforms the cascaded systems using ASR. For generation, we use a speechlanguage model (SLM) as a generator, conditioned on audio passages rather thantranscripts. Without fine-tuning of the SLM, this approach outperforms cascadedtext-based models when there is high WER in the transcripts.