Abstract
Large Language Models (LLMs) often struggle when prompted to generate contentunder specific constraints. However, in such cases it is often easy to checkwhether these constraints are satisfied or violated. Recent works have shownthat LLMs can benefit from such "corrective feedback". Here we claim that thisskill of LLMs can be significantly enhanced via training. We introduce an RLframework for teaching models to use such rewards, by simulating interactionsessions, and rewarding the model according to its ability to satisfy theconstraints. We refer to our method as CORGI (Controlled Generation with RL forGuided Interaction), and evaluate it on a variety of controlled generationtasks using unlabeled training data. We find that CORGI consistentlyoutperforms the baseline reinforcement learning method that does notincorporate conversational feedback. Furthermore, CORGI's interactive frameworkenables meta-learning, allowing the LLM to generalize better to guidedinteraction in new tasks. Our results clearly show that conversationaloptimization, when combined with reinforcement learning, significantly improvesthe effectiveness of LLMs in controlled generation contexts.