Abstract
Depression has proven to be a significant public health issue, profoundlyaffecting the psychological well-being of individuals. If it remainsundiagnosed, depression can lead to severe health issues, which can manifestphysically and even lead to suicide. Generally, Diagnosing depression or anyother mental disorder involves conducting semi-structured interviews alongsidesupplementary questionnaires, including variants of the Patient HealthQuestionnaire (PHQ) by Clinicians and mental health professionals. Thisapproach places significant reliance on the experience and judgment of trainedphysicians, making the diagnosis susceptible to personal biases. Given that theunderlying mechanisms causing depression are still being actively researched,physicians often face challenges in diagnosing and treating the condition,particularly in its early stages of clinical presentation. Recently,significant strides have been made in Artificial neural computing to solveproblems involving text, image, and speech in various domains. Our analysis hasaimed to leverage these state-of-the-art (SOTA) models in our experiments toachieve optimal outcomes leveraging multiple modalities. The experiments wereperformed on the Extended Distress Analysis Interview Corpus Wizard of Ozdataset (E-DAIC) corpus presented in the Audio/Visual Emotion Challenge (AVEC)2019 Challenge. The proposed solutions demonstrate better results achieved byProprietary and Open-source Large Language Models (LLMs), which achieved a RootMean Square Error (RMSE) score of 3.98 on Textual Modality, beating the AVEC2019 challenge baseline results and current SOTA regression analysisarchitectures. Additionally, the proposed solution achieved an accuracy of71.43% in the classification task. The paper also includes a novel audio-visualmulti-modal network that predicts PHQ-8 scores with an RMSE of 6.51.