Abstract
Despite rapid advancements in TTS models, a consistent and robust humanevaluation framework is still lacking. For example, MOS tests fail todifferentiate between similar models, and CMOS's pairwise comparisons aretime-intensive. The MUSHRA test is a promising alternative for evaluatingmultiple TTS systems simultaneously, but in this work we show that its relianceon matching human reference speech unduly penalises the scores of modern TTSsystems that can exceed human speech quality. More specifically, we conduct acomprehensive assessment of the MUSHRA test, focusing on its sensitivity tofactors such as rater variability, listener fatigue, and reference bias. Basedon our extensive evaluation involving 471 human listeners across Hindi andTamil we identify two primary shortcomings: (i) reference-matching bias, whereraters are unduly influenced by the human reference, and (ii) judgementambiguity, arising from a lack of clear fine-grained guidelines. To addressthese issues, we propose two refined variants of the MUSHRA test. The firstvariant enables fairer ratings for synthesized samples that surpass humanreference quality. The second variant reduces ambiguity, as indicated by therelatively lower variance across raters. By combining these approaches, weachieve both more reliable and more fine-grained assessments. We also releaseMANGO, a massive dataset of 47,100 human ratings, the first-of-its-kindcollection for Indian languages, aiding in analyzing human preferences anddeveloping automatic metrics for evaluating TTS systems.