Abstract
Domain generalizability is a crucial aspect of a deep learning model since itdetermines the capability of the model to perform well on data from unseendomains. However, research on the domain generalizability of deep learningmodels for vision-language tasks remains limited, primarily because of the lackof required datasets. To address these challenges, we propose VolDoGer:Vision-Language Dataset for Domain Generalization, a dedicated dataset designedfor domain generalization that addresses three vision-language tasks: imagecaptioning, visual question answering, and visual entailment. We constructedVolDoGer by extending LLM-based data annotation techniques to vision-languagetasks, thereby alleviating the burden of recruiting human annotators. Weevaluated the domain generalizability of various models, ranging fromfine-tuned models to a recent multimodal large language model, throughVolDoGer.