Language Models as Black-Box Optimizers for Vision-Language Models

Abstract

Vision-language models (VLMs) pre-trained on web-scale datasets havedemonstrated remarkable capabilities across a variety of vision and multimodaltasks. Currently, fine-tuning methods for VLMs mainly operate in a white-boxsetting, requiring access to model parameters for backpropagation. However,many VLMs rely on proprietary data and are not open-source, which restricts theuse of white-box approaches for fine-tuning. Given that popular private largelanguage models (LLMs) like ChatGPT still offer a language-based userinterface, we aim to develop a novel fine-tuning approach for VLMs throughnatural language prompts, thereby avoiding the need to access model parameters,feature embeddings, or output logits. In this setup, we propose employingchat-based LLMs as black-box optimizers to search for the best text prompt onthe illustrative task of few-shot image classification using CLIP.Specifically, we adopt an automatic "hill-climbing" procedure that converges onan effective prompt by evaluating the accuracy of current prompts and askingLLMs to refine them based on textual feedback, all within a conversationalprocess without human-in-the-loop. In a challenging 1-shot learning setup, oursimple approach surpasses the white-box continuous prompting method CoOp by anaverage of 1.5% across 11 datasets including ImageNet. Our approach alsooutperforms OpenAI's manually crafted prompts and is more efficient than otherblack-box methods like iterative APE. Additionally, we highlight the advantageof conversational feedback incorporating both positive and negative prompts,suggesting that LLMs can utilize the implicit "gradient" direction in textualfeedback for a more efficient search. Lastly, we find that the text promptsgenerated through our strategy are not only more interpretable but alsotransfer well across different CLIP architectures in a black-box manner.

Quick Read (beta)

loading the full paper ...