Abstract
The analysis and prediction of visual attention have long been crucial tasksin the fields of computer vision and image processing. In practicalapplications, images are generally accompanied by various text descriptions,however, few studies have explored the influence of text descriptions on visualattention, let alone developed visual saliency prediction models consideringtext guidance. In this paper, we conduct a comprehensive study on text-guidedimage saliency (TIS) from both subjective and objective perspectives.Specifically, we construct a TIS database named SJTU-TIS, which includes 1200text-image pairs and the corresponding collected eye-tracking data. Based onthe established SJTU-TIS database, we analyze the influence of various textdescriptions on visual attention. Then, to facilitate the development ofsaliency prediction models considering text influence, we construct a benchmarkfor the established SJTU-TIS database using state-of-the-art saliency models.Finally, considering the effect of text descriptions on visual attention, whilemost existing saliency models ignore this impact, we further propose atext-guided saliency (TGSal) prediction model, which extracts and integratesboth image features and text features to predict the image saliency undervarious text-description conditions. Our proposed model significantlyoutperforms the state-of-the-art saliency models on both the SJTU-TIS databaseand the pure image saliency databases in terms of various evaluation metrics.The SJTU-TIS database and the code of the proposed TGSal model will be releasedat: https://github.com/IntMeGroup/TGSal.