Scaling Synthetic Data Creation with 1,000,000,000 Personas

Abstract

We propose a novel persona-driven data synthesis methodology that leveragesvarious perspectives within a large language model (LLM) to create diversesynthetic data. To fully exploit this methodology at scale, we introducePersona Hub -- a collection of 1 billion diverse personas automatically curatedfrom web data. These 1 billion personas (~13% of the world's total population),acting as distributed carriers of world knowledge, can tap into almost everyperspective encapsulated within the LLM, thereby facilitating the creation ofdiverse synthetic data at scale for various scenarios. By showcasing PersonaHub's use cases in synthesizing high-quality mathematical and logical reasoningproblems, instructions (i.e., user prompts), knowledge-rich texts, game NPCsand tools (functions) at scale, we demonstrate persona-driven data synthesis isversatile, scalable, flexible, and easy to use, potentially driving a paradigmshift in synthetic data creation and applications in practice, which may have aprofound impact on LLM research and development.

Quick Read (beta)

loading the full paper ...