Abstract
Recent advances in Text-to-SQL have achieved strong results in static,single-turn tasks, where models generate SQL queries from natural languagequestions. However, these systems fall short in real-world interactivescenarios, where user intents evolve and queries must be refined over multipleturns. In applications such as finance and business analytics, usersiteratively adjust query constraints or dimensions based on intermediateresults. To evaluate such dynamic capabilities, we introduce DySQL-Bench, abenchmark assessing model performance under evolving user interactions. Unlikeprevious manually curated datasets, DySQL-Bench is built through an automatedtwo-stage pipeline of task synthesis and verification. Structured treerepresentations derived from raw database tables guide LLM-based taskgeneration, followed by interaction-oriented filtering and expert validation.Human evaluation confirms 100% correctness of the synthesized data. We furtherpropose a multi-turn evaluation framework simulating realistic interactionsamong an LLM-simulated user, the model under test, and an executable database.The model must adapt its reasoning and SQL generation as user intents change.DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on thePass@5 metric, underscoring the benchmark's difficulty. All code and data arereleased at https://github.com/Aurora-slz/Real-World-SQL-Bench .