Distributionally Robust Token Optimization in RLHF

  • 2026-05-11 17:42:04
  • Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis
  • 0

Abstract

Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.

 

Quick Read (beta)

loading the full paper ...