How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Abstract

The exploration of various vision-language tasks, such as visual captioning,visual question answering, and visual commonsense reasoning, is an importantarea in artificial intelligence and continuously attracts the researchcommunity's attention. Despite the improvements in overall performance, classicchallenges still exist in vision-language tasks and hinder the development ofthis area. In recent years, the rise of pre-trained models is driving theresearch on vision-language tasks. Thanks to the massive scale of training dataand model parameters, pre-trained models have exhibited excellent performancein numerous downstream tasks. Inspired by the powerful capabilities ofpre-trained models, new paradigms have emerged to solve the classic challenges.Such methods have become mainstream in current research with increasingattention and rapid advances. In this paper, we present a comprehensiveoverview of how vision-language tasks benefit from pre-trained models. First,we review several main challenges in vision-language tasks and discuss thelimitations of previous solutions before the era of pre-training. Next, wesummarize the recent advances in incorporating pre-trained models to addressthe challenges in vision-language tasks. Finally, we analyze the potentialrisks associated with the inherent limitations of pre-trained models anddiscuss possible solutions, attempting to provide future research directions.

Quick Read (beta)

loading the full paper ...