Defeating the Training-Inference Mismatch via FP16

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) oftensuffers from instability due to the numerical mismatch between the training andinference policies. While prior work has attempted to mitigate this issuethrough algorithmic corrections or engineering alignments, we show that itsroot cause lies in the floating point precision itself. The widely adoptedBF16, despite its large dynamic range, introduces large rounding errors thatbreaks the consistency between training and inference. In this work, wedemonstrate that simply reverting to \textbf{FP16} effectively eliminates thismismatch. The change is simple, fully supported by modern frameworks with onlya few lines of code change, and requires no modification to the modelarchitecture or learning algorithm. Our results suggest that using FP16uniformly yields more stable optimization, faster convergence, and strongerperformance across diverse tasks, algorithms and frameworks. We hope thesefindings motivate a broader reconsideration of precision trade-offs in RLfine-tuning.

Quick Read (beta)

loading the full paper ...