Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Abstract

Prompt ensembling of Large Language Model (LLM) generated category-specificprompts has emerged as an effective method to enhance zero-shot recognitionability of Vision-Language Models (VLMs). To obtain these category-specificprompts, the present methods rely on hand-crafting the prompts to the LLMs forgenerating VLM prompts for the downstream tasks. However, this requiresmanually composing these task-specific prompts and still, they might not coverthe diverse set of visual concepts and task-specific styles associated with thecategories of interest. To effectively take humans out of the loop andcompletely automate the prompt generation process for zero-shot recognition, wepropose Meta-Prompting for Visual Recognition (MPVR). Taking as input onlyminimal information about the target task, in the form of its short naturallanguage description, and a list of associated class labels, MPVR automaticallyproduces a diverse set of category-specific prompts resulting in a strongzero-shot classifier. MPVR generalizes effectively across various popularzero-shot image recognition benchmarks belonging to widely different domainswhen tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shotrecognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% onaverage over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

Quick Read (beta)

loading the full paper ...