TabR1: Taming GRPO for tabular reasoning LLMs

Abstract

Tabular prediction has traditionally relied on gradient-boosted decisiontrees and specialized deep learning models, which excel within tasks butprovide limited interpretability and weak transfer across tables. Reasoninglarge language models (LLMs) promise cross-task adaptability with trans- parentreasoning traces, yet their potential has not been fully realized for tabulardata. This paper presents TabR1, the first reasoning LLM for tabular predictionwith multi-step reasoning. At its core is Permutation Relative PolicyOptimization (PRPO), a simple yet efficient reinforcement learning method thatencodes column-permutation invariance as a structural prior. By construct- ingmultiple label-preserving permutations per sample and estimating advantagesboth within and across permutations, PRPO transforms sparse rewards into denselearning signals and improves generalization. With limited supervision, PRPOactivates the reasoning ability of LLMs for tabular prediction, enhancingfew-shot and zero-shot performance as well as interpretability. Comprehensiveexperiments demonstrate that TabR1 achieves performance comparable to strongbaselines under full-supervision fine-tuning. In the zero-shot setting, TabR1approaches the performance of strong baselines under the 32-shot setting.Moreover, TabR1 (8B) substantially outperforms much larger LLMs across varioustasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).

Quick Read (beta)

loading the full paper ...