Abstract
Large language models (LLMs) for audio have excelled in recognizing andanalyzing human speech, music, and environmental sounds. However, theirpotential for understanding other types of sounds, particularly biomedicalsounds, remains largely underexplored despite significant scientific interest.In this study, we focus on diagnosing cardiovascular diseases usingphonocardiograms, i.e., heart sounds. Most existing deep neural network (DNN)paradigms are restricted to heart murmur classification (healthy vs unhealthy)and do not predict other acoustic features of the murmur such as timing,grading, harshness, pitch, and quality, which are important in helpingphysicians diagnose the underlying heart conditions. We propose to finetune anaudio LLM, Qwen2-Audio, on the PhysioNet CirCor DigiScope phonocardiogram (PCG)dataset and evaluate its performance in classifying 11 expert-labeled murmurfeatures. Additionally, we aim to achieve more noise-robust and generalizablesystem by exploring a preprocessing segmentation algorithm using an audiorepresentation model, SSAMBA. Our results indicate that the LLM-based modeloutperforms state-of-the-art methods in 8 of the 11 features and performscomparably in the remaining 3. Moreover, the LLM successfully classifieslong-tail murmur features with limited training data, a task that all previousmethods have failed to classify. These findings underscore the potential ofaudio LLMs as assistants to human cardiologists in enhancing heart diseasediagnosis.