Abstract
The field of spoken language processing is undergoing a shift from trainingcustom-built, task-specific models toward using and optimizing spoken languagemodels (SLMs) which act as universal speech processing systems. This trend issimilar to the progression toward universal language models that has takenplace in the field of (text) natural language processing. SLMs include both"pure" language models of speech -- models of the distribution of tokenizedspeech sequences -- and models that combine speech encoders with text languagemodels, often including both spoken and written input or output. Work in thisarea is very diverse, with a range of terminology and evaluation settings. Thispaper aims to contribute an improved understanding of SLMs via a unifyingliterature survey of recent work in the context of the evolution of the field.Our survey categorizes the work in this area by model architecture, training,and evaluation choices, and describes some key challenges and directions forfuture work.