The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

Abstract

Large pretrained models can be used as annotators, helping replace or augmentcrowdworkers and enabling distilling generalist models into smaller specialistmodels. Unfortunately, this comes at a cost: employing top-of-the-line modelsoften requires paying thousands of dollars for API calls, while the resultingdatasets are static and challenging to audit. To address these challenges, wepropose a simple alternative: rather than directly querying labels frompretrained models, we task models to generate programs that can produce labels.These programs can be stored and applied locally, re-used and extended, andcost orders of magnitude less. Our system, Alchemist, obtains comparable to orbetter performance than large language model-based annotation in a range oftasks for a fraction of the cost: on average, improvements amount to a 12.9%enhancement while the total labeling costs across all datasets are reduced by afactor of approximately 500x.

Quick Read (beta)

loading the full paper ...