Abstract
Vision-Language Pretraining (VLP) models have recently successfullyfacilitated many cross-modal downstream tasks. Most existing works evaluatedtheir systems by comparing the fine-tuned downstream task performance. However,only average downstream task accuracy provides little information about thepros and cons of each VLP method, let alone provides insights on how thecommunity can improve the systems in the future. Inspired by the CheckList fortesting natural language processing, we exploit VL-CheckList, a novel frameworkto understand the capabilities of VLP models. The proposed method divides theimage-texting ability of a VLP model into three categories: objects,attributes, and relations, and uses a novel taxonomy to further break downthese three aspects. We conduct comprehensive studies to analyze seven recentlypopular VLP models via the proposed framework. Results confirm theeffectiveness of the proposed method by revealing fine-grained differencesamong the compared models that were not visible from downstream task-onlyevaluation. Further results show promising research direction in buildingbetter VLP models. Our data and code are available at:https://github.com/om-ai-lab/VL-CheckList.