Geometric Median Matching for Robust k-Subset Selection from Noisy Data

Abstract

Data pruning -- the combinatorial task of selecting a small andrepresentative subset from a large dataset, is crucial for mitigating theenormous computational costs associated with training data-hungry modern deeplearning models at scale. Since large scale data collections are invariablynoisy, developing data pruning strategies that remain robust even in thepresence of corruption is critical in practice. However, existing data pruningmethods often fail under high corruption rates due to their reliance onempirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subsetselection strategy that leverages Geometric Median -- a robust estimator withan optimal breakdown point of 1/2; to enhance resilience against noisy data.Our method iteratively selects a k-subset such that the mean of the subsetapproximates the GM of the (potentially) noisy dataset, ensuring robustnesseven under arbitrary corruption. We provide theoretical guarantees, showingthat GM Matching enjoys an improved O(1/k) convergence rate -- a quadraticimprovement over random sampling, even under arbitrary corruption. Extensiveexperiments across image classification and image generation tasks demonstratethat GM Matching consistently outperforms existing pruning approaches,particularly in high-corruption settings and at high pruning rates; making it astrong baseline for robust data pruning.

Quick Read (beta)

loading the full paper ...