Abstract
Reinforcement learning (RL) has emerged as a pivotal technique forfine-tuning large language models (LLMs) on specific tasks. However, prevailingRL fine-tuning methods predominantly rely on PPO and its variants. Though thesealgorithms are effective in general RL settings, they often exhibit suboptimalperformance and vulnerability to distribution collapse when applied to thefine-tuning of LLMs. In this paper, we propose CORY, extending the RLfine-tuning of LLMs to a sequential cooperative multi-agent reinforcementlearning framework, to leverage the inherent coevolution and emergentcapabilities of multi-agent systems. In CORY, the LLM to be fine-tuned isinitially duplicated into two autonomous agents: a pioneer and an observer. Thepioneer generates responses based on queries, while the observer generatesresponses using both the queries and the pioneer's responses. The two agentsare trained together. During training, the agents exchange roles periodically,fostering cooperation and coevolution between them. Experiments evaluate CORY'sperformance by fine-tuning GPT-2 and Llama-2 under subjective and objectivereward functions on the IMDB Review and GSM8K datasets, respectively. Resultsshow that CORY outperforms PPO in terms of policy optimality, resistance todistribution collapse, and training robustness, thereby underscoring itspotential as a superior methodology for refining LLMs in real-worldapplications.