Abstract
In this work, we propose GLOV, which enables Large Language Models (LLMs) toact as implicit optimizers for Vision-Language Models (VLMs) to enhancedownstream vision tasks. GLOV prompts an LLM with the downstream taskdescription, querying it for suitable VLM prompts (e.g., for zero-shotclassification with CLIP). These prompts are ranked according to their fitnessfor the downstream vision task. In each respective optimization step, theranked prompts are fed as in-context examples (with their accuracies) to equipthe LLM with the knowledge of the type of prompts preferred by the downstreamVLM. Furthermore, we explicitly guide the LLM's generation at each optimizationstep by adding an offset vector -- calculated from the embedding differencesbetween previous positive and negative solutions -- to the intermediate layerof the network for the next generation. This offset vector biases the LLMgeneration toward the type of language the downstream VLM prefers, resulting inenhanced performance on the downstream vision tasks. We comprehensivelyevaluate our GLOV on two tasks: object recognition and the critical task ofenhancing VLM safety. Our GLOV shows performance improvement by up to 15.0% and57.5% for dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LlaVA) modelsfor object recognition and reduces the attack success rate (ASR) onstate-of-the-art VLMs by up to $60.7\%$.