Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Abstract

Recent advances have witnessed the effectiveness of reinforcement learning(RL) finetuning in enhancing the reasoning capabilities of large languagemodels (LLMs). The optimization process often requires numerous iterations toachieve satisfactory performance, resulting in high computational costs due tothe need for frequent prompt evaluations under intensive LLM interactions andrepeated policy updates. Appropriate online prompt selection methods reduceiteration steps by prioritizing informative prompts during training, while thepipeline's reliance on exhaustive prompt evaluation and subset selection foroptimization still incurs substantial computational overhead due to frequentLLM inference calls. Distinguished from these direct evaluate-then-selectschemes, this work investigates iterative approximate evaluation for arbitraryprompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesianrisk-predictive framework that online estimates prompt difficulty withoutrequiring costly LLM interactions. Technically, MoPPS models each prompt'ssuccess rate as a latent variable, performs streaming Bayesian inference, andemploys posterior sampling in a constructed multi-armed bandit machine,enabling sample efficient and adaptive prompt selection. Extensive experimentsacross mathematics, planning, and vision-based geometry tasks show that MoPPSreliably predicts prompt difficulty and accelerates training with significantlyreduced LLM rollouts.

Quick Read (beta)

loading the full paper ...