RLSF: Reinforcement Learning via Symbolic Feedback

  • 2024-10-06 00:17:18
  • Piyush Jha, Prithwish Jana, Pranavkrishna Suresh, Arnav Arora, Vijay Ganesh
  • 0

Abstract

Reinforcement Learning with Human Feedback (RLHF) is considered a standardapproach to fine-tuning Large Language Models (LLMs). However, such methodsoften face limitations such as unsound black-box reward models, difficulties incollecting human preference data, and the reliance on sparse scalar rewards.These methods often fall short when applied to tasks that require complexdomain-specific understanding. To address these challenges, we propose a new fine-tuning paradigm we referto as Reinforcement Learning via Symbolic Feedback (RLSF), which aims toimprove domain-specific understanding of LLMs more effectively than traditionalreward signals. In the RLSF setting, the LLM being fine-tuned is considered anRL agent, while the environment is allowed access to reasoning or domainknowledge tools (e.g., solvers, provers, algebra systems, or knowledge bases).Crucially, in RLSF, these reasoning tools can provide feedback to the LLMs viapoly-sized certificates (e.g., proofs), that characterize errors in theLLM-generated object with respect to some correctness specification. As abonus, our RLSF approach does not require the reasoning systems we use to bedifferentiable. The ability of RLSF-based fine-tuning to leveragecertificate-generating symbolic tools enables sound fine-grained (token-level)reward signals to LLMs, and thus addresses the limitations of traditionalreward models mentioned above. Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMsoutperforms traditional approaches on five different applications, namely,program synthesis from natural language pseudo-code to programming language,three chemistry tasks, and solving the Game of 24. A takeaway is thatfine-tuning via RLSF enables relatively smaller LLMs to significantlyoutperform closed-source models that are orders of magnitude larger (e.g.,GPT-4).

 

Quick Read (beta)

loading the full paper ...