Abstract
Speech language models have recently demonstrated great potential asuniversal speech processing systems. Such models have the ability to model therich acoustic information existing in audio signals, beyond spoken content,such as emotion, background noise, etc. Despite this, evaluation benchmarkswhich evaluate awareness to a wide range of acoustic aspects, are lacking. Tohelp bridge this gap, we introduce SALMon, a novel evaluation suiteencompassing background noise, emotion, speaker identity and room impulseresponse. The proposed benchmarks both evaluate the consistency of theinspected element and how much it matches the spoken text. We follow amodelling based approach, measuring whether a model gives correct sampleshigher scores than incorrect ones. This approach makes the benchmark fast tocompute even for large models. We evaluated several speech language models onSALMon, thus highlighting the strengths and weaknesses of each evaluatedmethod. We make the code and data publicly available athttps://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .