Sparse Rewards Can Self-Train Dialogue Agents

  • 2025-07-18 17:06:00
  • Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang
  • 0

Abstract

Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM)agents, especially in multi-turn dialogue tasks, have been primarily driven bysupervised fine-tuning and high-quality human feedback. However, as base LLMmodels continue to improve, acquiring meaningful human feedback has becomeincreasingly challenging and costly. In certain domains, base LLM agents mayeventually exceed human capabilities, making traditional feedback-drivenmethods impractical. In this paper, we introduce a novel self-improvementparadigm that empowers LLM agents to autonomously enhance their performancewithout external human feedback. Our method, Juxtaposed Outcomes for SimulationHarvesting (JOSH), is a self-alignment algorithm that leverages a sparse rewardsimulation environment to extract ideal behaviors and further train the LLM onits own outputs. We present ToolWOZ, a sparse reward tool-calling simulationenvironment derived from MultiWOZ. We demonstrate that models trained withJOSH, both small and frontier, significantly improve tool-based interactionswhile preserving general model capabilities across diverse benchmarks. Our codeand data are publicly available on GitHub athttps://github.com/asappresearch/josh-llm-simulation-training

 

Quick Read (beta)

loading the full paper ...