Abstract
Large pretrained models can be used as annotators, helping replace or augmentcrowdworkers and enabling distilling generalist models into smaller specialistmodels. Unfortunately, this comes at a cost: employing top-of-the-line modelsoften requires paying thousands of dollars for API calls, while the resultingdatasets are static and challenging to audit. To address these challenges, wepropose a simple alternative: rather than directly querying labels frompretrained models, we task models to generate programs that can produce labels.These programs can be stored and applied locally, re-used and extended, andcost orders of magnitude less. Our system, Alchemist, obtains comparable to orbetter performance than large language model-based annotation in a range oftasks for a fraction of the cost: on average, improvements amount to a 12.9%enhancement while the total labeling costs across all datasets are reduced by afactor of approximately 500x.