Abstract
Text-driven speech style transfer aims to mold the intonation, pace, andtimbre of a spoken utterance to match stylistic cues from text descriptions.While existing methods leverage large-scale neural architectures or pre-trainedlanguage models, the computational costs often remain high. In this paper, wepresent \emph{ReverBERT}, an efficient framework for text-driven speech styletransfer that draws inspiration from a state space model (SSM) paradigm,loosely motivated by the image-based method of Wang andLiu~\cite{wang2024stylemamba}. Unlike image domain techniques, our methodoperates in the speech space and integrates a discrete Fourier transform oflatent speech features to enable smooth and continuous style modulation. Wealso propose a novel \emph{Transformer-based SSM} layer for bridging textualstyle descriptors with acoustic attributes, dramatically reducing inferencetime while preserving high-quality speech characteristics. Extensiveexperiments on benchmark speech corpora demonstrate that \emph{ReverBERT}significantly outperforms baselines in terms of naturalness, expressiveness,and computational efficiency. We release our model and code publicly to fosterfurther research in text-driven speech style transfer.