Abstract
In this paper, we introduce ConversaSynth, a framework designed to generatesynthetic conversation audio using large language models (LLMs) with multiplepersona settings. The framework first creates diverse and coherent text-baseddialogues across various topics, which are then converted into audio usingtext-to-speech (TTS) systems. Our experiments demonstrate that ConversaSyntheffectively generates highquality synthetic audio datasets, which cansignificantly enhance the training and evaluation of models for audio tagging,audio classification, and multi-speaker speech recognition. The resultsindicate that the synthetic datasets generated by ConversaSynth exhibitsubstantial diversity and realism, making them suitable for developing robust,adaptable audio-based AI systems.