Abstract
Detecting Mild Cognitive Impairment from picture descriptions is critical yetchallenging, especially in multilingual and multiple picture settings. Priorwork has primarily focused on English speakers describing a single picture(e.g., the 'Cookie Theft'). The TAUKDIAL-2024 challenge expands this scope byintroducing multilingual speakers and multiple pictures, which presents newchallenges in analyzing picture-dependent content. To address these challenges,we propose a framework with three components: (1) enhancing discriminativerepresentation learning via supervised contrastive learning, (2) involvingimage modality rather than relying solely on speech and text modalities, and(3) applying a Product of Experts (PoE) strategy to mitigate spuriouscorrelations and overfitting. Our framework improves MCI detection performance,achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to thetext unimodal baseline. Notably, the contrastive learning component yieldsgreater gains for the text modality compared to speech. These results highlightour framework's effectiveness in multilingual and multi-picture MCI detection.