Learning to Reason for Factuality

Abstract

Reasoning Large Language Models (R-LLMs) have significantly advanced complexreasoning tasks but often struggle with factuality, generating substantiallymore hallucinations than their non-reasoning counterparts on long-formfactuality benchmarks. However, extending online Reinforcement Learning (RL), akey component in recent R-LLM advancements, to the long-form factuality settingposes several unique challenges due to the lack of reliable verificationmethods. Previous work has utilized automatic factuality evaluation frameworkssuch as FActScore to curate preference data in the offline RL setting, yet wefind that directly leveraging such methods as the reward in online RL leads toreward hacking in multiple ways, such as producing less detailed or relevantresponses. We propose a novel reward function that simultaneously considers thefactual precision, response detail level, and answer relevance, and appliesonline RL to learn high quality factual reasoning. Evaluated on six long-formfactuality benchmarks, our factual reasoning model achieves an averagereduction of 23.1 percentage points in hallucination rate, a 23% increase inanswer detail level, and no degradation in the overall response helpfulness.

Quick Read (beta)

loading the full paper ...