Abstract
The demand for high-fidelity test data is paramount in industrial settingswhere access to production data is largely restricted. Traditional datageneration methods often fall short, struggling with low-fidelity and theability to model complex data structures and semantic relationships that arecritical for testing complex SQL code generation services like Natural Languageto SQL (NL2SQL). In this paper, we address the critical need for generatingsyntactically correct and semantically relevant high-fidelity mock data forcomplex data structures that includes columns with nested structures that wefrequently encounter in Google workloads. We highlight the limitations ofexisting approaches used in production, particularly their inability to handlelarge and complex data structures, as well as the lack of semantically coherenttest data that lead to limited test coverage. We demonstrate that by leveragingLarge Language Models (LLMs) and incorporating strategic pre- andpost-processing steps, we can generate syntactically correct and semanticallyrelevant high-fidelity test data that adheres to complex structural constraintsand maintains semantic integrity to the SQL test targets (queries/functions).This approach supports comprehensive testing of complex SQL queries involvingjoins, aggregations, and even deeply nested subqueries, ensuring robustevaluation of SQL code generation services, like NL2SQL and SQL Code Assistant.Our results demonstrate the practical utility of an LLM (\textit{gemini}) basedtest data generation for industrial SQL code generation services wheregenerating high-fidelity test data is essential due to the frequentunavailability and inaccessibility of production datasets for testing.