Abstract
Utilizing large language models (LLMs) for transforming natural languagequestions into SQL queries (text-to-SQL) is a promising yet challengingapproach, particularly when applied to real-world databases with complex andextensive schemas. In particular, effectively incorporating data catalogs anddatabase values for SQL generation remains an obstacle, leading to suboptimalsolutions. We address this problem by proposing a new pipeline that effectivelyretrieves relevant data and context, selects an efficient schema, andsynthesizes correct and efficient SQL queries. To increase retrieval precision,our pipeline introduces a hierarchical retrieval method leveragingmodel-generated keywords, locality-sensitive hashing indexing, and vectordatabases. Additionally, we have developed an adaptive schema pruning techniquethat adjusts based on the complexity of the problem and the model's contextsize. Our approach generalizes to both frontier proprietary models like GPT-4and open-source models such as Llama-3-70B. Through a series of ablationstudies, we demonstrate the effectiveness of each component of our pipeline andits impact on the end-to-end performance. Our method achieves newstate-of-the-art performance on the cross-domain challenging BIRD dataset.