Abstract
Remote sensing visual question answering (RSVQA) opens new opportunities forthe use of overhead imagery by the general public, by enabling human-machineinteraction with natural language. Building on the recent advances in naturallanguage processing and computer vision, the goal of RSVQA is to answer aquestion formulated in natural language about a remote sensing image. Languageunderstanding is essential to the success of the task, but has not yet beenthoroughly examined in RSVQA. In particular, the problem of language biases isoften overlooked in the remote sensing community, which can impact modelrobustness and lead to wrong conclusions about the performances of the model.Thus, the present work aims at highlighting the problem of language biases inRSVQA with a threefold analysis strategy: visual blind models, adversarialtesting and dataset analysis. This analysis focuses both on model and data.Moreover, we motivate the use of more informative and complementary evaluationmetrics sensitive to the issue. The gravity of language biases in RSVQA is thenexposed for all of these methods with the training of models discarding theimage data and the manipulation of the visual input during inference. Finally,a detailed analysis of question-answer distribution demonstrates the root ofthe problem in the data itself. Thanks to this analytical study, we observedthat biases in remote sensing are more severe than in standard VQA, likely dueto the specifics of existing remote sensing datasets for the task, e.g.geographical similarities and sparsity, as well as a simpler vocabulary andquestion generation strategies. While new, improved and less-biased datasetsappear as a necessity for the development of the promising field of RSVQA, wedemonstrate that more informed, relative evaluation metrics remain much neededto transparently communicate results of future RSVQA methods.