Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Abstract

Large language models trained with reinforcement learning with verifiablerewards tend to trade accuracy for length--inflating response lengths toachieve gains in accuracy. While longer answers may be warranted for harderproblems, many tokens are merely "filler": repetitive, verbose text that makesno real progress. We introduce GFPO (Group Filtered Policy Optimization), whichcurbs this length explosion by sampling larger groups per problem duringtraining and filtering responses to train on based on two key metrics: (1)response length and (2) token efficiency: reward per token ratio. By samplingmore at training time, we teach models to think less at inference time. On thePhi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% acrosschallenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH,LiveCodeBench) while maintaining accuracy. Optimizing for reward per tokenfurther increases reductions in length inflation to 71-85%. We also proposeAdaptive Difficulty GFPO, which dynamically allocates more training resourcesto harder problems based on real-time difficulty estimates, improving thebalance between computational efficiency and accuracy especially on difficultquestions. GFPO demonstrates that increased training-time compute directlytranslates to reduced test-time compute--a simple yet effective trade-off forefficient reasoning.

Quick Read (beta)

loading the full paper ...