Offline Reinforcement Learning for LLM Multi-Step Reasoning

Abstract

Improving the multi-step reasoning ability of large language models (LLMs)with offline reinforcement learning (RL) is essential for quickly adapting themto complex tasks. While Direct Preference Optimization (DPO) has shown promisein aligning LLMs with human preferences, it is less suitable for multi-stepreasoning tasks because (1) DPO relies on paired preference data, which is notreadily available for multi-step reasoning tasks, and (2) it treats all tokensuniformly, making it ineffective for credit assignment in multi-step reasoningtasks, which often come with sparse reward. In this work, we propose OREO(Offline Reasoning Optimization), an offline RL method for enhancing LLMmulti-step reasoning. Building on insights from previous works of maximumentropy reinforcement learning, it jointly learns a policy model and valuefunction by optimizing the soft Bellman Equation. We show in principle that itreduces the need to collect pairwise data and enables better credit assignment.Empirically, OREO surpasses existing offline learning methods on multi-stepreasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) andembodied agent control (ALFWorld). The approach can be extended to amulti-iteration framework when additional resources are available. Furthermore,the learned value function can be leveraged to guide the tree search for free,which can further boost performance during test time.

Quick Read (beta)

loading the full paper ...