Outcome-based Reinforcement Learning to Predict the Future

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effectiveapproach for improving Large Language Models' reasoning in domains such ascoding and mathematics. Here, we apply RLVR methods towards forecasting futurereal-world events - a challenging task for RL due to the very noisy (anddelayed) outcomes involved. Using a novel dataset of recent questions from aprediction market, and accompanying relevant news headlines, we show that acompact (14B) reasoning model can be trained to match or surpass the predictiveaccuracy of frontier models like o1, while greatly improving probabilisticcalibration. The model's performance is also practically meaningful: in aPolymarket trading simulation, we estimate that its bets would have yielded areturn on investment of over 10% across all questions in the test set. Wedetail and compare approaches used in training our model, including augmentingour training-data with synthetic prediction questions, guardrails for learningstability, and median prediction sampling at inference-time.

Quick Read (beta)

loading the full paper ...