Abstract
This paper explores a novel perspective to speech quality assessment byleveraging natural language descriptions, offering richer, more nuancedinsights than traditional numerical scoring methods. Natural language feedbackprovides instructive recommendations and detailed evaluations, yet existingdatasets lack the comprehensive annotations needed for this approach. To bridgethis gap, we introduce QualiSpeech, a comprehensive low-level speech qualityassessment dataset encompassing 11 key aspects and detailed natural languagecomments that include reasoning and contextual insights. Additionally, wepropose the QualiSpeech Benchmark to evaluate the low-level speechunderstanding capabilities of auditory large language models (LLMs).Experimental results demonstrate that finetuned auditory LLMs can reliablygenerate detailed descriptions of noise and distortion, effectively identifyingtheir types and temporal characteristics. The results further highlight thepotential for incorporating reasoning to enhance the accuracy and reliabilityof quality assessments. The dataset will be released athttps://huggingface.co/datasets/tsinghua-ee/QualiSpeech.