G-Core: A Simple, Scalable and Balanced RLHF Trainer

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become an increasinglypopular paradigm for training large language models (LLMs) and diffusionmodels. While existing RLHF training systems have enabled significant progress,they often face challenges in scaling to multi-modal and diffusion workflowsand adapting to dynamic workloads. In particular, current approaches mayencounter limitations in controller scalability, flexible resource placement,and efficient orchestration when handling complex RLHF pipelines, especially inscenarios involving dynamic sampling or generative reward modeling. In thispaper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHFtraining framework designed to address these challenges. G-Core introduces aparallel controller programming model, enabling flexible and efficientorchestration of complex RLHF workflows without the bottlenecks of a singlecentralized controller. Furthermore, we propose a dynamic placement schema thatadaptively partitions resources and schedules workloads, significantly reducinghardware idle time and improving utilization, even under highly variabletraining conditions. G-Core has successfully trained models that support WeChatproduct features serving a large-scale user base, demonstrating itseffectiveness and robustness in real-world scenarios. Our results show thatG-Core advances the state of the art in RLHF training, providing a solidfoundation for future research and deployment of large-scale, human-alignedmodels.

Quick Read (beta)

loading the full paper ...