LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Abstract

Large language models (LLMs) provide excellent text-generation capabilities,but standard prompting and generation methods generally do not lead tointentional or goal-directed agents and might necessitate considerable prompttuning. This becomes particularly apparent in multi-turn conversations: eventhe best current LLMs rarely ask clarifying questions, engage in explicitinformation gathering, or take actions now that lead to better decisions aftermultiple turns. Reinforcement learning has the potential to leverage thepowerful modeling capabilities of LLMs, as well as their internalrepresentation of textual interactions, to create capable goal-directedlanguage agents. This can enable intentional and temporally extendedinteractions, such as with humans, through coordinated persuasion and carefullycrafted questions, or in goal-directed play through text games to bring aboutdesired final outcomes. However, enabling this requires the community todevelop stable and reliable reinforcement learning algorithms that caneffectively train LLMs. Developing such algorithms requires tasks that cangauge progress on algorithm design, provide accessible and reproducibleevaluations for multi-turn interactions, and cover a range of task propertiesand challenges in improving reinforcement learning algorithms. Our paperintroduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs,together with an open-source research framework containing a basic toolkit forgetting started on multi-turn RL with offline value-based and policy-based RLmethods. Our benchmark consists of 8 different language tasks, which requiremultiple rounds of language interaction and cover a range of tasks inopen-ended dialogue and text games.

Quick Read (beta)

loading the full paper ...