Abstract
Vision-language pre-training has recently emerged as a promising alternativefor representation learning. It shifts from the tradition of using images anddiscrete labels for learning a fixed set of weights, seen as visual concepts,to aligning images and raw text for two separate encoders. Such a paradigmbenefits from a broader source of supervision and allows zero-shot transfer todownstream tasks since visual concepts can be diametrically generated fromnatural language, known as prompt. In this paper, we identify that a majorchallenge of deploying such models in practice is prompt engineering. This isbecause designing a proper prompt, especially for context words surrounding aclass name, requires domain expertise and typically takes a significant amountof time for words tuning since a slight change in wording could have a hugeimpact on performance. Moreover, different downstream tasks require specificdesigns, further hampering the efficiency of deployment. To overcome thischallenge, we propose a novel approach named context optimization (CoOp). Themain idea is to model context in prompts using continuous representations andperform end-to-end learning from data while keeping the pre-trained parametersfixed. In this way, the design of task-relevant prompts can be fully automated.Experiments on 11 datasets show that CoOp effectively turns pre-trainedvision-language models into data-efficient visual learners, requiring as few asone or two shots to beat hand-crafted prompts with a decent margin and able togain significant improvements when using more shots (e.g., at 16 shots theaverage gain is around 17% with the highest reaching over 50%). CoOp alsoexhibits strong robustness to distribution shift.