Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning

Abstract

We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach thatenhances the exploration-exploitation tradeoff by dynamically assigning weightsto generated outputs based on their advantage and entropy for ReinforcementLearning-based Large Language Model fine-tuning. EGSW integrates entropyregularization with advantage-based weighting to balance policy updates,enabling efficient exploration in high-dimensional state spaces. By employingtemperature-scaled softmax weighting over sequences, EGSW prioritizinghigh-reward, high-uncertainty steps while maintaining training stability.Although originally developed to improve Group Relative Policy Optimization(GRPO) during large language model (LLM) fine-tuning, EGSW is generalizable toother reinforcement learning (RL) algorithms and can be implemented in bothstep-wise and trajectory-wise settings. Empirical evaluations demonstrate thatEGSW enhances GRPO reasoning ability, yielding improvements in sampleefficiency. Future work will explore the application of EGSW to advanced RLmethodologies.

Quick Read (beta)

loading the full paper ...