Abstract
Commonsense is defined as the knowledge that is shared by everyone. However,certain types of commonsense knowledge are correlated with culture andgeographic locations and they are only shared locally. For example, thescenarios of wedding ceremonies vary across regions due to different customsinfluenced by historical and religious factors. Such regional characteristics,however, are generally omitted in prior work. In this paper, we construct aGeo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to testvision-and-language models' ability to understand cultural andgeo-location-specific commonsense. In particular, we study two state-of-the-artVision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standardmultimodal commonsense benchmark with images primarily from Western regions. Wethen evaluate how well the trained models can generalize to answering thequestions in GD-VCR. We find that the performance of both models fornon-Western regions including East Asia, South Asia, and Africa issignificantly lower than that for Western region. We analyze the reasons behindthe performance disparity and find that the performance gap is larger on QApairs that: 1) are concerned with culture-related scenarios, e.g., weddings,religious activities, and festivals; 2) require high-level geo-diversecommonsense reasoning rather than low-order perception and recognition. Datasetand code are released at https://github.com/WadeYin9712/GD-VCR.