Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

Abstract

Machine learning-based decision support systems are increasingly deployed inclinical settings, where probabilistic scoring functions are used to inform andprioritize patient management decisions. However, widely used scoring rules,such as accuracy and AUC-ROC, fail to adequately reflect key clinicalpriorities, including calibration, robustness to distributional shifts, andsensitivity to asymmetric error costs. In this work, we propose a principledyet practical evaluation framework for selecting calibrated thresholdedclassifiers that explicitly accounts for the uncertainty in class prevalencesand domain-specific cost asymmetries often found in clinical settings. Buildingon the theory of proper scoring rules, particularly the Schervishrepresentation, we derive an adjusted variant of cross-entropy (log score) thataverages cost-weighted performance over clinically relevant ranges of classbalance. The resulting evaluation is simple to apply, sensitive to clinicaldeployment conditions, and designed to prioritize models that are bothcalibrated and robust to real-world variations.

Quick Read (beta)

loading the full paper ...