Abstract
Reinforcement learning (RL) is a critical component of large language model(LLM) post-training. However, existing on-policy algorithms used forpost-training are inherently incompatible with the use of experience replaybuffers, which can be populated scalably by distributed off-policy actors toenhance exploration as compute increases. We propose efficiently obtaining thisbenefit of replay buffers via Trajectory Balance with Asynchrony (TBA), amassively scalable LLM RL system. In contrast to existing approaches, TBA usesa larger fraction of compute on search, constantly generating off-policy datafor a central replay buffer. A training node simultaneously samples data fromthis buffer based on reward or recency to update the policy using TrajectoryBalance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBAoffers three key advantages: (1) decoupled training and search, speeding uptraining wall-clock time by 4x or more; (2) improved diversity throughlarge-scale off-policy sampling; and (3) scalable search for sparse rewardsettings. On mathematical reasoning, preference-tuning, and automatedred-teaming (diverse and representative post-training tasks), TBA producesspeed and performance improvements over strong baselines.