Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark

Abstract

Text-to-SQL, which involves translating natural language into StructuredQuery Language (SQL), is crucial for enabling broad access to structureddatabases without expert knowledge. However, designing models for such tasks ischallenging due to numerous factors, including the presence of 'noise,' such asambiguous questions and syntactical errors. This study provides an in-depthanalysis of the distribution and types of noise in the widely used BIRD-Benchbenchmark and the impact of noise on models. While BIRD-Bench was created tomodel dirty and noisy database values, it was not created to contain noise anderrors in the questions and gold queries. We found that noise in questions andgold queries are prevalent in the dataset, with varying amounts across domains,and with an uneven distribution between noise types. The presence of incorrectgold SQL queries, which then generate incorrect gold answers, has a significantimpact on the benchmark's reliability. Surprisingly, when evaluating models oncorrected SQL queries, zero-shot baselines surpassed the performance ofstate-of-the-art prompting methods. We conclude that informative noise labelsand reliable benchmarks are crucial to developing new Text-to-SQL methods thatcan handle varying types of noise.

Quick Read (beta)

loading the full paper ...