Outcome-based Reinforcement Learning to Predict the Future

  • 2025-05-26 16:34:33
  • Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger
  • 0

Abstract

Reinforcement learning with verifiable rewards (RLVR) has boosted math andcoding in large language models, yet there has been little effort to extendRLVR into messier, real-world domains like forecasting. One sticking point isthat outcome-based reinforcement learning for forecasting must learn frombinary, delayed, and noisy rewards, a regime where standard fine-tuning isbrittle. We show that outcome-only online RL on a 14B model can matchfrontier-scale accuracy and surpass it in calibration and hypotheticalprediction market betting by adapting two leading algorithms, Group-RelativePolicy Optimisation (GRPO) and ReMax, to the forecasting setting. Ouradaptations remove per-question variance scaling in GRPO, applybaseline-subtracted advantages in ReMax, hydrate training with 100k temporallyconsistent synthetic questions, and introduce lightweight guard-rails thatpenalise gibberish, non-English responses and missing rationales, enabling asingle stable pass over 110k events. Scaling ReMax to 110k questions andensembling seven predictions yields a 14B model that matches frontier baselineo1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it incalibration (ECE = 0.042, p < 0.001). A simple trading rule turns thiscalibration edge into \$127 of hypothetical profit versus \$92 for o1 (p =0.037). This demonstrates that refined RLVR methods can convert small-scaleLLMs into potentially economically valuable forecasting tools, withimplications for scaling this to larger models.

 

Quick Read (beta)

loading the full paper ...