Abstract
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of humancognition-inspired evaluation pipeline for voice understanding large languagemodels, LLM Voice, designed to assess their voice understanding ability. Movingbeyond popular voice understanding metrics such as word error rate (WER), SIQexamines LLM Voice across three cognitive levels motivated by Bloom's Taxonomy:(1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e.,similarity of LLM's interpretations); and (3) Application (i.e., QA accuracyfor simulating downstream tasks). We demonstrate that SIQ not only quantifiesvoice understanding abilities but also provides unified comparisons betweencascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotationerrors in existing benchmarks, and detects hallucinations in LLM Voice. Ourframework represents a first-of-its-kind intelligence examination that bridgescognitive principles with voice-oriented benchmarks, while exposing overlookedchallenges in multi-modal training.