Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning

Abstract

Active Learning (AL) allows models to learn interactively from user feedback.This paper introduces a counterfactual data augmentation approach to AL,particularly addressing the selection of datapoints for user querying, apivotal concern in enhancing data efficiency. Our approach is inspired byVariation Theory, a theory of human concept learning that emphasizes theessential features of a concept by focusing on what stays the same and whatchanges. Instead of just querying with existing datapoints, our approachsynthesizes artificial datapoints that highlight potential key similarities anddifferences among labels using a neuro-symbolic pipeline combining largelanguage models (LLMs) and rule-based models. Through an experiment in theexample domain of text classification, we show that our approach achievessignificantly higher performance when there are fewer annotated data. As theannotated training data gets larger the impact of the generated data starts todiminish showing its capability to address the cold start problem in AL. Thisresearch sheds light on integrating theories of human learning into theoptimization of AL.

Quick Read (beta)

loading the full paper ...