Batch Active Learning at Scale

Abstract

The ability to train complex and highly effective models often requires anabundance of training data, which can easily become a bottleneck in cost, time,and computational resources. Batch active learning, which adaptively issuesbatched queries to a labeling oracle, is a common approach for addressing thisproblem. The practical benefits of batch sampling come with the downside ofless adaptivity and the risk of sampling redundant examples within a batch -- arisk that grows with the batch size. In this work, we analyze an efficientactive learning algorithm, which focuses on the large batch setting. Inparticular, we show that our sampling method, which combines notions ofuncertainty and diversity, easily scales to batch sizes (100K-1M) severalorders of magnitude larger than used in previous studies and providessignificant improvements in model training efficiency compared to recentbaselines. Finally, we provide an initial theoretical analysis, proving labelcomplexity guarantees for a related sampling method, which we show isapproximately equivalent to our sampling method in specific settings.

Quick Read (beta)

loading the full paper ...