Abstract
Question answering (QA) agents automatically answer questions posed innatural language. In this work, we learn to ask clarifying questions in QAagents. The key idea in our method is to simulate conversations that containclarifying questions and learn from them using reinforcement learning (RL). Tomake RL practical, we propose and analyze offline RL objectives that can beviewed as reward-weighted supervised fine-tuning (SFT) and easily optimized inlarge language models. Our work stands in a stark contrast to recently proposedmethods, based on SFT and direct preference optimization, which have additionalhyper-parameters and do not directly optimize rewards. We compare to thesemethods empirically and report gains in both optimized rewards and languagequality.