Synthetic Data Generation with LLM for Improved Depression Prediction

Abstract

Automatic detection of depression is a rapidly growing field of research atthe intersection of psychology and machine learning. However, with itsexponential interest comes a growing concern for data privacy and scarcity dueto the sensitivity of such a topic. In this paper, we propose a pipeline forLarge Language Models (LLMs) to generate synthetic data to improve theperformance of depression prediction models. Starting from unstructured,naturalistic text data from recorded transcripts of clinical interviews, weutilize an open-source LLM to generate synthetic data through chain-of-thoughtprompting. This pipeline involves two key steps: the first step is thegeneration of the synopsis and sentiment analysis based on the originaltranscript and depression score, while the second is the generation of thesynthetic synopsis/sentiment analysis based on the summaries generated in thefirst step and a new depression score. Not only was the synthetic datasatisfactory in terms of fidelity and privacy-preserving metrics, it alsobalanced the distribution of severity in the training dataset, therebysignificantly enhancing the model's capability in predicting the intensity ofthe patient's depression. By leveraging LLMs to generate synthetic data thatcan be augmented to limited and imbalanced real-world datasets, we demonstratea novel approach to addressing data scarcity and privacy concerns commonlyfaced in automatic depression detection, all while maintaining the statisticalintegrity of the original dataset. This approach offers a robust framework forfuture mental health research and applications.

Quick Read (beta)

loading the full paper ...