Outcome-based Reinforcement Learning to Predict the Future

Abstract

Reinforcement learning with verifiable rewards (RLVR) has boosted math andcoding in large language models, yet there has been little effort to extendRLVR into messier, real-world domains like forecasting. One sticking point isthat outcome-based reinforcement learning for forecasting must learn frombinary, delayed, and noisy rewards, a regime where standard fine-tuning isbrittle. We show that outcome-only online RL on a 14B model can matchfrontier-scale accuracy and surpass it in calibration and hypotheticalprediction market betting by adapting two leading algorithms, Group-RelativePolicy Optimisation (GRPO) and ReMax, to the forecasting setting. Ouradaptations remove per-question variance scaling in GRPO, applybaseline-subtracted advantages in ReMax, hydrate training with 100k temporallyconsistent synthetic questions, and introduce lightweight guard-rails thatpenalise gibberish, non-English responses and missing rationales, enabling asingle stable pass over 110k events. Scaling ReMax to 110k questions andensembling seven predictions yields a 14B model that matches frontier baselineo1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it incalibration (ECE = 0.042, p < 0.001). A simple trading rule turns thiscalibration edge into \$127 of hypothetical profit versus \$92 for o1 (p =0.037). This demonstrates that refined RLVR methods can convert small-scaleLLMs into potentially economically valuable forecasting tools, withimplications for scaling this to larger models.

Quick Read (beta)

loading the full paper ...