Abstract
Voice disorders significantly impact patient quality of life, yetnon-invasive automated diagnosis remains under-explored due to both thescarcity of pathological voice data, and the variability in recording sources.This work introduces MVP (Multi-source Voice Pathology detection), a novelapproach that leverages transformers operating directly on raw voice signals.We explore three fusion strategies to combine sentence reading and sustainedvowel recordings: waveform concatenation, intermediate feature fusion, anddecision-level combination. Empirical validation across the German, Portuguese,and Italian languages shows that intermediate feature fusion using transformersbest captures the complementary characteristics of both recording types. Ourapproach achieves up to +13% AUC improvement over single-source methods.