Abstract
Image geolocalization, in which, traditionally, an AI model predicts theprecise GPS coordinates of an image is a challenging task with many downstreamapplications. However, the user cannot utilize the model to further theirknowledge other than the GPS coordinate; the model lacks an understanding ofthe location and the conversational ability to communicate with the user. Inrecent days, with tremendous progress of large multimodal models (LMMs)proprietary and open-source researchers have attempted to geolocalize imagesvia LMMs. However, the issues remain unaddressed; beyond general tasks, formore specialized downstream tasks, one of which is geolocalization, LMMsstruggle. In this work, we propose to solve this problem by introducing aconversational model GAEA that can provide information regarding the locationof an image, as required by a user. No large-scale dataset enabling thetraining of such a model exists. Thus we propose a comprehensive dataset GAEAwith 800K images and around 1.6M question answer pairs constructed byleveraging OpenStreetMap (OSM) attributes and geographical context clues. Forquantitative evaluation, we propose a diverse benchmark comprising 4Kimage-text pairs to evaluate conversational capabilities equipped with diversequestion types. We consider 11 state-of-the-art open-source and proprietaryLMMs and demonstrate that GAEA significantly outperforms the best open-sourcemodel, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by8.28%. Our dataset, model and codes are available