Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

Abstract

We establish a formal connection between the decades-old surrogate outcomemodel in biostatistics and economics and the emerging field ofprediction-powered inference (PPI). The connection treats predictions frompre-trained models, prevalent in the age of AI, as cost-effective surrogatesfor expensive outcomes. Building on the surrogate outcomes literature, wedevelop recalibrated prediction-powered inference, a more efficient approach tostatistical inference than existing PPI proposals. Our method departs from theexisting proposals by using flexible machine learning techniques to learn theoptimal ``imputed loss'' through a step we call recalibration. Importantly, themethod always improves upon the estimator that relies solely on the data withavailable true outcomes, even when the optimal imputed loss is estimatedimperfectly, and it achieves the smallest asymptotic variance among PPIestimators if the estimate is consistent. Computationally, our optimizationobjective is convex whenever the loss function that defines the targetparameter is convex. We further analyze the benefits of recalibration, boththeoretically and numerically, in several common scenarios where machinelearning predictions systematically deviate from the outcome of interest. Wedemonstrate significant gains in effective sample size over existing PPIproposals via three applications leveraging state-of-the-art machinelearning/AI models.

Quick Read (beta)

loading the full paper ...