How well do LLMs reason over tabular data, really?

Abstract

Large Language Models (LLMs) excel in natural language tasks, but less isknown about their reasoning capabilities over tabular data. Prior analysesdevise evaluation strategies that poorly reflect an LLM's realistic performanceon tabular queries. Moreover, we have a limited understanding of the robustnessof LLMs towards realistic variations in tabular inputs. Therefore, we ask: Cangeneral-purpose LLMs reason over tabular data, really?, and focus on twoquestions 1) are tabular reasoning capabilities of general-purpose LLMs robustto real-world characteristics of tabular inputs, and 2) how can werealistically evaluate an LLM's performance on analytical tabular queries?Building on a recent tabular reasoning benchmark, we first surface shortcomingsof its multiple-choice prompt evaluation strategy, as well as commonly usedfree-form text metrics such as SacreBleu and BERT-score. We show that anLLM-as-a-judge procedure yields more reliable performance insights and unveil asignificant deficit in tabular reasoning performance of LLMs. We then extendthe tabular inputs reflecting three common characteristics in practice: 1)missing values, 2) duplicate entities, and 3) structural variations.Experiments show that the tabular reasoning capabilities of general-purposeLLMs suffer from these variations, stressing the importance of improving theirrobustness for realistic tabular inputs.

Quick Read (beta)

loading the full paper ...