Abstract
Language agents have become a promising solution to complex interactivetasks. One of the key ingredients to the success of language agents is thereward model on the trajectory of the agentic workflow, which provides valuableguidance during training or inference. However, due to the lack of annotationsof intermediate interactions, most existing works use an outcome reward modelto optimize policies across entire trajectories. This may lead to sub-optimalpolicies and hinder the overall performance. To address this, we propose QLASS(Q-guided Language Agent Stepwise Search), to automatically generateannotations by estimating Q-values in a stepwise manner for open languageagents. By introducing a reasoning tree and performing process reward modeling,QLASS provides effective intermediate guidance for each step. With the stepwiseguidance, we propose a Q-guided generation strategy to enable language agentsto better adapt to long-term value, resulting in significant performanceimprovement during model inference on complex interactive agent tasks. Notably,even with almost half the annotated data, QLASS retains strong performance,demonstrating its efficiency in handling limited supervision. We alsoempirically demonstrate that QLASS can lead to more effective decision makingthrough qualitative analysis. We will release our code and data.