RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Abstract

The development of autonomous agents for complex, long-horizon tasks is acentral goal in AI. However, dominant training paradigms face a criticallimitation: reinforcement learning (RL) methods that optimize solely for finaltask success often reinforce flawed or inefficient reasoning paths, a problemwe term inefficient exploration. This leads to agents that are brittle and failto generalize, as they learn to find solutions without learning how to reasoncoherently. To address this, we introduce RLVMR, a novel framework thatintegrates dense, process-level supervision into end-to-end RL by rewardingverifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tagits cognitive steps, such as planning, exploration, and reflection, andprovides programmatic, rule-based rewards for actions that contribute toeffective problem-solving. These process-centric rewards are combined with thefinal outcome signal and optimized using a critic-free policy gradient method.On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves newstate-of-the-art results, with our 7B model reaching an 83.6% success rate onthe most difficult unseen task split. Our analysis confirms these gains stemfrom improved reasoning quality, including significant reductions in redundantactions and enhanced error recovery, leading to more robust, efficient, andinterpretable agents.

Quick Read (beta)

loading the full paper ...