Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

Abstract

The prevailing paradigm in the domain of Open-Domain Dialogue agentspredominantly focuses on the English language, encompassing both models anddatasets. Furthermore, the financial and temporal investments required forcrowdsourcing such datasets for finetuning are substantial, particularly whenmultiple languages are involved. Fortunately, advancements in Large LanguageModels (LLMs) have unveiled a plethora of possibilities across diverse tasks.Specifically, instruction-tuning has enabled LLMs to execute tasks based onnatural language instructions, occasionally surpassing the performance of humancrowdworkers. Additionally, these models possess the capability to function invarious languages within a single thread. Consequently, to generate new samplesin different languages, we propose leveraging these capabilities to replicatethe data collection process. We introduce a pipeline for generating Open-DomainDialogue data in multiple Target Languages using LLMs, with demonstrationsprovided in a unique Source Language. By eschewing explicit Machine Translationin this approach, we enhance the adherence to language-specific nuances. Weapply this methodology to the PersonaChat dataset. To enhance the openness ofgenerated dialogues and mimic real life scenarii, we added the notion of speechevents corresponding to the type of conversation the speakers are involved inand also that of common ground which represents the premises of a conversation.

Quick Read (beta)

loading the full paper ...