Abstract
Synthetic data generation has emerged as an invaluable solution in scenarioswhere real-world data collection and usage are limited by cost and scarcity.Large language models (LLMs) have demonstrated remarkable capabilities inproducing high-fidelity, domain-relevant samples across various fields.However, existing approaches that directly use LLMs to generate each recordindividually impose prohibitive time and cost burdens, particularly when largevolumes of synthetic data are required. In this work, we propose a fast,cost-effective method for realistic tabular data synthesis that leverages LLMsto infer and encode each field's distribution into a reusable sampling script.By automatically classifying fields into numerical, categorical, or free-texttypes, the LLM generates distribution-based scripts that can efficientlyproduce diverse, realistic datasets at scale without continuous modelinference. Experimental results show that our approach outperforms traditionaldirect methods in both diversity and data realism, substantially reducing theburden of high-volume synthetic data generation. We plan to apply thismethodology to accelerate testing in production pipelines, thereby shorteningdevelopment cycles and improving overall system efficiency. We believe ourinsights and lessons learned will aid researchers and practitioners seekingscalable, cost-effective solutions for synthetic data generation.