Tracking the risk of a deployed model and detecting harmful distribution shifts

Abstract

When deployed in the real world, machine learning models inevitably encounterchanges in the data distribution, and certain -- but not all -- distributionshifts could result in significant performance degradation. In practice, it maymake sense to ignore benign shifts, under which the performance of a deployedmodel does not degrade substantially, making interventions by a human expert(or model retraining) unnecessary. While several works have developed tests fordistribution shifts, these typically either use non-sequential methods, ordetect arbitrary shifts (benign or harmful), or both. We argue that a sensiblemethod for firing off a warning has to both (a) detect harmful shifts whileignoring benign ones, and (b) allow continuous monitoring of model performancewithout increasing the false alarm rate. In this work, we design simplesequential tools for testing if the difference between source (training) andtarget (test) distributions leads to a significant drop in a risk function ofinterest, like accuracy or calibration. Recent advances in constructingtime-uniform confidence sequences allow efficient aggregation of statisticalevidence accumulated during the tracking process. The designed framework isapplicable in settings where (some) true labels are revealed after theprediction is performed, or when batches of labels become available in adelayed fashion. We demonstrate the efficacy of the proposed framework throughan extensive empirical study on a collection of simulated and real datasets.

Quick Read (beta)

loading the full paper ...