DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Abstract

Recent studies on end-to-end (E2E) speech generation with large languagemodels (LLMs) have attracted significant community attention, with multipleworks extending text-based LLMs to generate discrete speech tokens. ExistingE2E approaches primarily fall into two categories: (1) Methods that generatediscrete speech tokens independently without incorporating them into the LLM'sautoregressive process, resulting in text generation being unaware ofconcurrent speech synthesis. (2) Models that generate interleaved or parallelspeech-text tokens through joint autoregressive modeling, enabling mutualmodality awareness during generation. This paper presents DrVoice, a parallelspeech-text voice conversation model based on joint autoregressive modeling,featuring dual-resolution speech representations. Notably, while currentmethods utilize mainly 12.5Hz input audio representation, our proposeddual-resolution mechanism reduces the input frequency for the LLM to 5Hz,significantly reducing computational cost and alleviating the frequencydiscrepancy between speech and text tokens and in turn better exploiting LLMs'capabilities. Experimental results demonstrate that DRVOICE-7B establishes newstate-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, whileachieving performance comparable to the SOTA on VoiceBench and UltraEval-Audiobenchmarks, making it a leading open-source speech foundation model in ~7Bmodels.

Quick Read (beta)

loading the full paper ...