LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Abstract

Schema linking is a critical bottleneck in achieving human-level performancein Text-to-SQL tasks, particularly in real-world large-scale multi-databasescenarios. Addressing schema linking faces two major challenges: (1) DatabaseRetrieval: selecting the correct database from a large schema pool inmulti-database settings, while filtering out irrelevant ones. (2) Schema ItemGrounding: accurately identifying the relevant tables and columns from within alarge and redundant schema for SQL generation. To address this, we introduceLinkAlign, a novel framework that can effectively adapt existing baselines toreal-world environments by systematically addressing schema linking. Ourframework comprises three key steps: multi-round semantic enhanced retrievaland irrelevant information isolation for Challenge 1, and schema extractionenhancement for Challenge 2. We evaluate our method performance of schemalinking on the SPIDER and BIRD benchmarks, and the ability to adapt existingText-to-SQL models to real-world environments on the SPIDER 2.0-lite benchmark.Experiments show that LinkAlign outperforms existing baselines inmulti-database settings, demonstrating its effectiveness and robustness. On theother hand, our method ranks highest among models excluding those using longchain-of-thought reasoning LLMs. This work bridges the gap between currentresearch and real-world scenarios, providing a practical solution for robustand scalable schema linking. The codes are available athttps://github.com/Satissss/LinkAlign.

Quick Read (beta)

loading the full paper ...