Revisiting inference after prediction

Abstract

Recent work has focused on the very common practice of prediction-basedinference: that is, (i) using a pre-trained machine learning model to predictan unobserved response variable, and then (ii) conducting inference on theassociation between that predicted response and some covariates. As pointed outby Wang et al. (2020), applying a standard inferential approach in (ii) doesnot accurately quantify the association between the unobserved (as opposed tothe predicted) response and the covariates. In recent work, Wang et al. (2020)and Angelopoulos et al. (2023) propose corrections to step (ii) in order toenable valid inference on the association between the unobserved response andthe covariates. Here, we show that the method proposed by Angelopoulos et al.(2023) successfully controls the type 1 error rate and provides confidenceintervals with correct nominal coverage, regardless of the quality of thepre-trained machine learning model used to predict the unobserved response.However, the method proposed by Wang et al. (2020) provides valid inferenceonly under very strong conditions that rarely hold in practice: for instance,if the machine learning model perfectly estimates the true regression functionin the study population of interest.

Quick Read (beta)

loading the full paper ...